Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Jan 29;16:6572. doi: 10.1038/s41598-026-37763-w

A lightweight hybrid perception enhancement network for infrared image super-resolution

Zepeng Liu 1,✉,#, Jiya Tian 1,#, Chao Liu 1, Guodong Zhang 1,
PMCID: PMC12909808  PMID: 41611970

Abstract

Infrared image super-resolution (SR) remains a challenging task due to inherent limitations in existing approaches: convolutional neural network (CNN)-based approaches struggle with long-range dependency modeling, whereas transformer-based approaches are computationally expensive and tend to overlook fine local details. To address these issues, we propose a novel hybrid perception enhancement network (HPEN). Its core component is a hybrid perception enhancement block (HPEB), which effectively combines a token aggregation block (TAB) for global context modeling, a multi-scale feature enhancement block (MFEB) for local detail extraction, and a convolutional layer for feature refinement. Extensive experimental results demonstrate that the proposed HPEN achieves leading performance among compared methods. For the challenging Inline graphic SR task, it attains the best PSNR and SSIM values among the evaluated lightweight SR approaches, while demonstrating remarkable efficiency advantages. Specifically, compared to HiT-SR, HPEN reduces FLOPs by 42.9%, uses only 9.4% of the GPU memory usage, and delivers a Inline graphic faster inference speed. The code is available at https://github.com/smilenorth1/HPEN-main.

Keywords: Infrared image super-resolution, Hybrid perception, Token aggregation, Multi-scale

Subject terms: Computational science, Computer science, Scientific data

Introduction

Infrared imaging plays a vital role in several critical fields due to its all-weather capability and ability to capture thermal information. For instance, in surveillance and security systems1,2, it enables person and vehicle detection under low-visibility conditions; in defense and aerospace3, it supports target recognition and night-vision navigation; in medical diagnostics4,5, it assists in non-invasive screening such as inflammation detection and vascular imaging; and in industrial inspection6, it helps identify overheating components or structural defects in power equipment and machinery. However, the spatial resolution of infrared sensors is inherently limited, which often restricts their practical usefulness. Infrared image super-resolution (SR) offers a promising solution to enhance image quality beyond the constraints of physical hardware. Although existing methods have made considerable progress, they still face a fundamental challenge in balancing reconstruction accuracy, particularly in preserving fine details with computational efficiency required for real-time deployment.

Convolutional Neural Networks (CNNs)714 have become the dominant paradigm in image SR, leveraging their inherent strengths in local feature extraction and hierarchical representation learning. However, CNN-based methods7,8,10,15 are constrained by three fundamental limitations: (1) their restricted receptive fields impede the modeling of long-range dependencies; (2) static convolution kernels lack adaptability to multi-scale patterns; and (3) conventional architectures often fail to support effective cross-resolution interactions. Although some efforts have attempted to address these issues by designing deeper and wider networks (e.g., RCAN16 with over 400 layers), such expansions come at the cost of prohibitive computational complexity, which severely limits their practical deployment in efficiency-sensitive real-world scenarios.

Recent advances in vision transformer (ViT) architectures1721 have demonstrated remarkable capabilities in image SR, primarily attributed to the global receptive field and powerful context modeling enabled by the self-attention (SA) mechanism. However, two critical limitations remain: (1) the computational complexity of self-attention leads to high memory and processing requirements17,18,22; and (2) its focus on long-range dependencies often results in inadequate representation of high-frequency local details, such as edges and textures23,24. Although subsequent studies have attempted to alleviate these issues through various ways (e.g., local self attention in clustering25, and hierarchical windows26), there are still challenges in balancing lightweight and detail preservation for infrared image SR.

To address these issues, we propose a novel hybrid perception enhancement block (HPEB) that systematically combines three complementary components: a token aggregation block (TAB) for comprehensive global feature modeling, a multi-scale feature enhancement block (MFEB) for extracting local-contextual details, and a Inline graphic convolution for cross-channel feature refinement and spatial information optimization. Specifically, the TAB module follows a sophisticated three-stage process: it first groups similar image patches using learnable token centers updated via exponential moving average; then employs a dual-branch architecture integrating intra-group self-attention (IASA) and inter-group cross-attention (IRCA) to fuse local refinements with global semantics; and finally applies a convolutional feed-forward network (ConvFFN) to further enhance local feature representation. The MFEB adopts a parallel strategy, using dual-branch dilated convolutions (with dilation rates of 1 and 2) to capture both fine details and broader context, dynamically fuses these multi-scale features via a Inline graphic convolution, and subsequently strengthens global dependencies through restormer Layers (RTL). By integrating these carefully designed modules, we construct an end-to-end trainable network named HPEN. Extensive experimental results demonstrates that the proposed HPEN achieves an optimal balance between computational efficiency and reconstruction quality, as quantitatively shown in Fig. 1.

Fig. 1.

Fig. 1

Comprehensive comparison between the proposed method and other lightweight approaches for Inline graphic SR on the CVC-09 dataset. The left subplot illustrates the reconstruction quality (PSNR) against both parameter count and computational complexity (FLOPs), with circle sizes representing FLOPs magnitude. The right subplot demonstrates PSNR in relation to inference time and GPU memory consumption, where circle sizes correspond to GPU memory usage. Experimental results indicate that the proposed HPEN achieves a favorable balance between computational overhead (parameters, FLOPs, runtime, and GPU memory) and reconstruction performance (PSNR) compared to existing lightweight SR models.

The main contributions of this work can be summarized as follows:

  • We propose a novel hybrid perception enhancement block (HPEB) that effectively integrates global structural modeling with local detail preservation in a unified architecture, enabling simultaneous capture of long-range dependencies and fine-grained textures for high-quality infrared image super-resolution.

  • We develop a multi-scale feature enhancement block (MFEB) that applies dual-branch dilated convolutions to extract local-contextual information and integrates restormer Layers (RTLs) to enhance global dependencies, thereby establishing a collaborative local-global optimization framework.

  • We conduct comprehensive evaluation of the proposed HPEN model across six datasets, demonstrating its ability to achieve an optimal trade-off between model complexity and reconstruction performance.

Related works

CNN-based SR methods

CNNs have established themselves as a cornerstone of modern computer vision, owing to their exceptional ability to learn hierarchical feature representations. This dominance naturally extends to the field of image SR. The pioneering SRCNN7 first introduces an end-to-end CNN framework for single image SR. Subsequently, VDSR10 advances training efficiency by incorporating residual learning27 and optimized convergence strategies. Later innovations, such as DRCN9 and DRRN28, further improve reconstruction quality through recursive structures and enhanced residual connections27. SRGAN12 revolutionizes perceptual quality by employing adversarial training to overcome the limitations of conventional loss functions. However, constrained by the inherently local receptive field of convolution operations, these methods struggle to capture long-range dependencies and global contextual information, leading to limited representational capacity. To address this issue, researchers have progressively increased network scale and complexity. For instance, EDSR29 significantly boosts performance by expanding to 32 layers and 43 million parameters, while RCAN16 constructs the architecture exceeding 400 layers through channel attention and residual connections. Nevertheless, this pursuit of performance through greater depth and width comes at a significant cost: the high computational complexity and memory footprint of these large models severely hinder their deployment in real-world applications scenarios.

To address computational complexity, numerous lightweight CNN8,11,13,14,30 designs have been developed to balance performance and efficiency through various strategies. For instance, FSRCNN8 and ESPCN15 improve efficiency by employing post-upsampling operations. CARN31 achieves fast and accurate SR using group convolution and cascaded feature integration. The concept of information distillation, introduced by IDN32 through enhancement and compression units, is further advanced by IMDN11 with multi-distillation modules that optimize the memory-time trade-off. RFDN30 further refines this concept with enhanced feature distillation links, winning the NTIRE 2022 Efficient SR Challenge. BSRN13 reduces computational redundancy via blueprint separable convolution, while ShuffleMixer33 minimizes FLOPs by employing large kernels combined with channel operations. More recent approaches incorporate dynamic feature modulation for greater flexibility. For example, SAFMN14 implements spatially adaptive modulation within a ViT-like architecture for long-range dependency modeling, and SMFANet34 leverages dual-branch processing to handle both local and non-local features. Despite these advancements, such methods remain fundamentally constrained by the use of fixed convolutional kernels, leading to edge blurring and detail loss. To overcome these limitations, we propose a hybrid perception enhancement block (HPEB), which facilitates effective global modeling and hierarchical feature interactions, thereby significantly improving reconstruction quality while maintaining high computational efficiency.

ViT-based SR methods

ViTs17 have recently shown strong potential for image SR, primarily due to their ability to model long-range dependencies and capture global context information. The pioneering work IPT18 first adapts the transformer architecture for SR tasks, establishing new performance benchmarks. However, the standard ViT design incurs very high computational costs, mainly because its global SA mechanism22 scales quadratically with input size. To mitigate this, several efficient variants have been developed. SwinIR20 employs shifted window attention to limit computation to local windows while maintaining cross-window connection, greatly reducing complexity. HAT35 further improves this approach by integrating channel attention with overlapping cross-attention, achieving state-of-the-art results. Restormer21proposes a complementary strategy using multi-head transposed attention to efficiently capture global context25. improves parallel processing through subgroup token balancing and incorporates residual local relation self-attention to harmonize global and local representations under linear complexity, while HiT-SR26 employs hierarchical window expansion to capture multi-scale contexts and long-range dependencies for efficient SR reconstruction.

Despite these advancements, current ViT-based SR methods still struggle to optimally balancing performance and efficiency. Most either sacrifice reconstruction quality to maintain reasonable computational costs, or achieve superior performance at the expense of prohibitive complexity. This fundamental trade-off highlights the need for a more balanced approach that can maintain high reconstruction quality while ensuring practical deploy-ability.

Infrared image SR methods

Deep learning has significantly advanced infrared image processing by demonstrating powerful capabilities in automated feature extraction and complex nonlinear mapping. Infrared image SR, in particular, has emerged as a promising application. Numerous deep learning methods have been developed specifically for this domain: TherISuRNet36 utilizes progressive upscaling with asymmetric residual learning, while PSRGAN37 a dual-path architecture that leverages visible light imagery to compensate for limited IR training data. Marivani et al.38 incorporate sparse priors within a multi-modal framework, and MoCoPnet39 integrates domain knowledge to address the inherent feature scarcity in IR small target SR. For efficient deployment, LISN40 achieves lightweight reconstruction through feature correlation aggregation and channel operations, while LKFormer41 replaces self-attention with large-kernel depth-wise convolutions for non-local modeling and employs gated-pixel feed-forward networks to improve information flow. Additionally, RLKA-Net42 employs recurrent strategies and large kernel attention to achieve efficient infrared SR. LIRSRN43 incorporates an attention enhancement module with spatial-frequency processing. Despite these advancements, existing methods still face limitations in optimally balancing reconstruction performance with computational efficiency. Most approaches either prioritize accuracy at the cost of practical deploy-ability, or achieve efficiency while compromising the restoration of fine details that are critical for infrared applications, highlighting the need for more balanced solutions in infrared SR domain.

In recent years, general-purpose computer vision architectures have achieved remarkable progress in tasks such as detection and classification. The object detection frameworks such as SSD44, Faster R-CNN45, and YOLOv846 have achieved strong performance in real-time processing, small object detection, and complex scene understanding by integrating lightweight designs11,30, multi-scale feature fusion14,34, and attention mechanisms13,42,47. Similarly, modern convolutional architectures like ConvNeXt48 have attained near-state-of-the-art accuracy in classification tasks, particularly in medical imaging through deep integration with attention modules. Mask R-CNN49 has also achieved significant results in the field of instance segmentation. Beyond perceptual tasks, related research in image security has introduced methods such as a structured chromatic encryption algorithm50 that combines color mapping with mathematical transformations to protect sensitive medical multimedia data, and a reversible fragile watermarking scheme51 that embeds two bits per pixel to generate dual watermarked images, achieving high capacity and robustness while preserving visual transparency. These advances collectively reflect a common design principle: balancing performance and complexity through architectural refinement and efficient attention fusion. However, these studies primarily target high-level vision tasks such as detection and classification in the visible spectrum, and their design paradigms are not directly applicable to infrared image super-resolution, a representative low-level vision task. Infrared imagery exhibits unique characteristics, including thermal radiation properties, low contrast, and complex noise patterns, which demand specialized models capable of enhanced detail recovery and targeted degradation modeling. Inspired by the “hybrid architecture” and “lightweight attention” principles in these advanced works, this paper focuses on addressing the specific challenges of infrared image SR. We propose the HPEN network, which incorporates a core hybrid perception enhancement block. By combining customized token aggregation with multi-scale convolutional enhancement, our approach achieves synergistic improvement of both global structures and local details in infrared images, establishing an improved balance between reconstruction accuracy and computational efficiency.

Proposed method

We propose a simple yet effective SR model that delivers accurate infrared image reconstruction through collaborative exploiting local and non-local features. The core innovation is the hybrid perception enhancement block (HPEB), a unified architecture that systematically combines complementary feature processing mechanisms. The HPEB operates in three sequential stages: (1) a token aggregation block (TAB) to capture global contextual dependencies; (2) a multi-scale feature enhancement module (MFEM) to extract fine-grained local details; and (3) a Inline graphic convolutional layer for feature refinement. This synergistic design allows the model to focus on information-rich regions while maintaining computational efficiency, thereby establishing a new balance between performance and complexity in infrared SR. The details of each component are elaborated below.

Figure 2 illustrates the overall architecture of the proposed HPEN network. It consists of three main components: shallow feature extraction, deep feature extraction, and an upsampler module. Specifically, a Inline graphic convolutional layer is first applied for the low-resolution input Inline graphic to extract shallow features, denoted as Inline graphic. This process can be formulated as:

graphic file with name d33e580.gif 1

where Inline graphic denotes shallow feature extraction operation. These shallow features are then fed into a stack of hybrid perception enhancement blocks (HPEBs). Each HPEB contains a token aggregation block (TAB), a multi-scale feature enhancement module (MFEM), and a Inline graphic convolutional layer. This procedure can be expressed as:

graphic file with name d33e594.gif 2

where Inline graphic represents the deep feature extraction process carried out by the stacked HPEBs, and Inline graphic refers to the resulting deep feature representation. To facilitate gradient flow and preserve low-frequency information, we incorporate global residual connections. This allows the network to concentrate on reconstructing high-frequency details. The upsampler module, which consists of a Inline graphic convolution followed by a sub-pixel convolution layer15, then efficiently reconstructs the high-resolution output. This process can be represented as:

graphic file with name d33e616.gif 3

where Inline graphic is the upsampler module that reconstructs the high-resolution image, yielding the final output Inline graphic. For loss function, we adopt the Inline graphic loss, following the previous work34,42 in SR tasks, which is defined as:

graphic file with name d33e641.gif 4

where Inline graphic represents the ground-truth high-resolution image, and Inline graphic denotes the Inline graphic-norm operator.

Fig. 2.

Fig. 2

Overall architecture of the proposed HPEN. The network comprises three fundamental components: an initial Inline graphic convolutional layer for shallow feature extraction, a series of hybrid perception enhancement block (HPEBs) for deep feature extraction, and a final upsampling module for high-resolution reconstruction. Each HPEB represents the core building block and integrates three carefully designed sub-modules: a token aggregation block (TAB) for global dependency modeling, a multi-scale feature enhancement block (MFEB) for local detail extraction, and a Inline graphic convolutional layer for feature refinement and integration.

Token aggregation block

In image SR tasks, conventional CNN-based methods14,29,30 are limited by the fixed receptive fields of their convolutional kernels, while transformer-based approaches18,20,35 often incur excessive computational costs and lack inherent spatial inductive bias despite their global modeling capacity. To address these issues, we introduce the token aggregation block (TAB) adapted from CATANet25, which efficiently captures feature interactions through content-aware dynamic reorganization and a hierarchical attention mechanism. As illustrated in Fig. 2(b), TAB consists of three core components: the content-aware token aggregation (CATA) module, intra-group self-attention (IASA), and inter-group cross-attention (IRCA). Given an input feature Inline graphic, the CATA module dynamically clusters image patches using a set of learnable token centers. These centers are updated during training via an exponential moving average (EMA) strategy with a decay parameter Inline graphic=0.999 to enable semantically coherent grouping. The CATA operation can be expressed as:

graphic file with name d33e698.gif 5

where Inline graphic denotes the layer normalization operation, Inline graphic represents the clustered features. The IASA module then performs fine-grained feature interactions within each token subgroup. Overlapped grouping is used to preserve local spatial continuity. This operation can be written as:

graphic file with name d33e711.gif 6

where Inline graphic corresponds to the output feature of the IASA module. Subsequently, the IRCA module facilitates global semantic propagation by enabling interaction across token groups and a global center token. The IRCA module is formulated as follows:

graphic file with name d33e721.gif 7

where Inline graphic denotes the output feature of the IRCA module. Finally, a convolutional feed-forward network (ConvFFN) is applied to further integrates local features and channel information, thereby enhancing the model’s nonlinear representation capacity. This step is expressed as:

graphic file with name d33e730.gif 8

where Inline graphic denotes the Inline graphic convolution, Inline graphic represents the intermediate aggregated features, Inline graphic corresponds to the convolution-enhanced feed-forward network, and Inline graphic is the final output of the TAB module.

Multi-scale feature enhancement block

While the TAB focuses on modeling global dependencies through dynamic token aggregation, high-quality image reconstruction also critically depends on fine local details. To this end, we propose a multi-scale feature enhancement block (MFEB) that explicitly enhances local feature extraction while integrating global contextual information, forming a collaborative local-global optimization framework. As shown in Fig. 2(c), given an input feature Inline graphic, the MFEB first splits it into two parallel branches, formulated as:

graphic file with name d33e767.gif 9

where Inline graphic denotes the channel splitting operation. One branch then applies a Inline graphic convolution to capture fine-grained textures, while the other utilizes a Inline graphic dilated convolution (dilation rate=2) to model broader contextual information. This process can be formulated as:

graphic file with name d33e785.gif 10

where Inline graphic represents a Inline graphic dilated convolution with a dilation rate of 2, Inline graphic indicates a depth-wise convolution of kernel size Inline graphic, and Inline graphic refers to the GELU activation function. The outputs of the two branches are then dynamically fused via a Inline graphic convolution. The fused process can be expressed as:

graphic file with name d33e815.gif 11

where Inline graphic denotes the channel concatenation operation. To facilitate gradient flow across network levels, we incorporate residual connections. Finally, two successive restormer layers (RTLs)21 are applied to further strengthen global feature dependencies. This process can be written as:

graphic file with name d33e828.gif 12

where Inline graphic represents the restormer layer transformation.

Restormer layer

To enhance long-range dependency modeling and feature representation, we incorporate the RTL21. This module combines multi-dconv head transposed attention (MDTA) and gated-dconv feed-forward network (GDFN) to improve global context modeling and nonlinear transformation of fused features. As illustrated in Fig. 2(d), given an input Inline graphic, the MDTA module aggregates both local and non-local pixel interactions, formulated as:

graphic file with name d33e853.gif 13

where Inline graphic denotes the output of the MDTA module. The GDFN then dynamically modulates feature transformation by suppressing less informative components and selectively propagating relevant information through the network. The operation can be formulated as:

graphic file with name d33e863.gif 14

where Inline graphic represents the output feature of the GDFN module, and Inline graphic is defined as before.

Hybrid perception enhancement block

We construct the HPEB by integrating the TAB, MFEB, and a Inline graphic convolutional layer in sequence. This three-stage cascaded design enables progressive feature refinement. Specifically, the TAB first performs content-aware global feature reorganization via dynamic token aggregation for the input. Given an input Inline graphic, this step is expressed as:

graphic file with name d33e888.gif 15

where Inline graphic represents the output of the TAB module. The MFEB module then enhances local-contextual representations using multi-scale convolutions (dilation rates of 1 and 2), formulated as;

graphic file with name d33e897.gif 16

where Inline graphic represents the output of the MFEB module. Finally, a Inline graphic convolution preserves multi-granularity details, preventing high-frequency information from being over-smoothed by preceding transformer operations, expressed as:

graphic file with name d33e911.gif 17

where Inline graphic denotes a Inline graphic convolution operation, and Inline graphic represents the final output of the HPEB.

Experimental results

Datasets and implementation details

Datasets. We use the IR700 dataset52 as the source of high-resolution (HR) images. This dataset comprises 700 infrared images with a resolution of Inline graphic, encompassing diverse scenarios including urban landscapes, vegetation, pedestrians, vehicles, and low-visibility conditions at night. Its authentic noise distribution and atmospheric degradation characteristics help mitigate the domain shift problem common in synthetically generated data. The dataset is partitioned into training, validation, and test sets in an 8:1:1 ratio, resulting in 560, 70, and 70 images, respectively. Low-resolution (LR) images are synthesized by applying bicubic downsampling to the HR images. For evaluation, we utilize five test datasets: results-A53 and results-C54, CVC09-10053, IR700-test52, M3FD55, and my_test. We quantitatively evaluate the reconstruction quality using two standard metrics: peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), both computed on the luminance (Y) channel.

Implementation details. The model is trained with the following configuration. The LR input patches of size Inline graphic are randomly cropped and augmented via random horizontal flipping and rotation. We train the network for 100,000 iterations with a batch size of 16, using the Adam optimizer56, Inline graphic and Inline graphic and a MultiStepLR57 learning rate scheduler. The initial learning rate is set to Inline graphic and decays to a minimum of Inline graphic. The network employs 8 HPEBs with 36 feature channels. All experiments are conducted on a Linux server using PyTorch 1.13.0 and CUDA 11.8 (i.e., PyTorch, using an NVIDIA GeForce RTX 3090 GPU (24GB VRAM), an Intel Xeon E5-2620 v4 CPU, and 32GB of system memory.

Comparisons with state-of-the-art methods

Quantitative comparison. The proposed method is compared with state-of-the-art lightweight SR approaches, including CARN31, IMDN11, RFDN30, BSRN13, SMSR58, ShuffleMixer33, HNCT59, PSRGAN37, SAFMN14, RLKA-Net42, SMFANet34, SCNet60, LIRSRN43, CATANet25, and HiT-SR26. Quantitative results for Inline graphic and Inline graphic magnification factors are summarized in Table 2. In addition to the standard PSNR and SSIM metrics, we also report the number of parameters (#Params) and computational complexity (#FLOPs). The FLOPs are uniformly measured using the fvcore library (i.e., fvcore.nn.flop_count_str) with an output resolution of Inline graphic.

Table 2.

A comprehensive comparison with lightweight SR methods on five test datasets is presented. All reported PSNR and SSIM values are calculated on the luminance (Y) channel in the transformed color space. Computational complexity (#FLOPs) is evaluated using high-resolution images with a fixed resolution of Inline graphic pixels. The top two indicators are highlighted in italic and bold italic font respectively for clear distinction.

Scale Model Params(K) FLOPs(G) resultsA resultsC CVC-09 IR700-test M3FD my_test
Inline graphic CARN31 1592 297 39.75/0.9489 40.75/0.9589 44.83/0.9737 38.14/0.9435 40.20/0.9676 37.08/0.9157
IMDN11 715 218 39.77/0.9491 40.76/0.9589 44.87/0.9740 38.19/0.9438 40.28/0.9677 37.09/0.9159
RFDN30 417 122 39.76/0.9491 40.77/0.9590 44.88/0.9739 38.18/0.9437 40.13/0.9670 37.11/0.9161
BSRN13 332 96 39.77/0.9495 40.78/0.9593 44.85/0.9739 38.22/0.9440 40.24/0.9673 37.10/0.9160
SMSR58 987 468 39.76/0.9491 40.76/0.9590 44.84/0.9737 38.14/0.9435 40.20/0.9669 37.08/0.9158
ShuffleMixer_tiny33 247 76 39.74/0.9490 40.76/0.9589 44.85/0.9738 38.17/0.9437 40.16/0.9673 37.08/0.9158
HNCT59 357 111 39.78/0.9492 40.78/0.9591 44.88/0.9739 38.21/0.9440 40.28/0.9679 37.11/0.9159
PSRGAN37 313 - 39.76/0.9490 40.71/0.9589 44.79/0.9731 38.16/0.9434 38.24/0.9524 37.09/0.9158
SAFMN14 228 69 39.75/0.9491 40.64/0.9589 44.86/0.9739 38.13/0.9435 40.29/0.9679 37.08/0.9158
RLKA-Net42 225 253 39.79/0.9493 40.78/0.9590 44.83/0.9737 38.19/0.9438 40.05/0.9657 37.09/0.9158
SMFANet34 186 52 39.77/0.9493 40.78/0.9592 44.85/0.9738 38.11/0.9434 40.28/0.9677 37.06/0.9157
SCNet60 146 56 39.74/0.9489 40.74/0.9588 44.82/0.9737 38.08/0.9432 40.29/0.9679 37.06/0.9157
LIRSRN43 49 14 39.40/0.9475 40.11/0.9573 43.98/0.9730 37.03/0.9364 39.46/0.9658 36.27/0.9087
CATANet25 477 197 39.80/0.9493 40.79/0.9590 44.89/0.9738 38.26/0.9441 40.34/0.9684 37.09/0.9159
HiT-SR26 847 299 39.81/0.9496 40.81/0.9592 44.90/0.9741 38.27/0.9440 40.36/0.9683 37.12/0.9160
HPEN(Ours) 434 173 39.83/0.9497 40.82/0.9594 44.92/0.9740 38.28/0.9442 40.37/0.9684 37.13/0.9162
Inline graphic CARN31 1592 121 34.77/0.8615 35.43/0.8783 40.54/0.9513 31.87/0.8447 32.93/0.8666 34.22/0.8752
IMDN11 715 55 34.83/0.8626 35.51/0.8791 40.73/0.9521 31.93/0.8450 33.07/0.8679 34.27/0.8757
RFDN30 433 32 34.86/0.8630 35.50/0.8793 40.75/0.9523 32.01/0.8462 33.09/0.8679 34.26/0.8757
BSRN13 352 26 34.90/0.8636 35.56/0.8799 40.77/0.9523 32.07/0.8472 33.14/0.8702 34.29/0.8761
SMSR58 1008 119 34.88/0.8631 35.52/0.8792 40.74/0.9521 32.08/0.8465 33.09/0.8700 34.26/0.8754
ShuffleMixer_tiny33 251 21 34.79/0.8621 35.49/0.8788 40.60/0.9516 31.92/0.8447 32.99/0.8673 34.23/0.8753
HNCT59 373 29 34.92/0.8638 35.58/0.8802 40.84/0.9527 32.12/0.8480 33.15/0.8699 34.30/0.8761
PSRGAN37 350 - 34.81/0.8616 35.47/0.8790 40.67/0.9516 32.04/0.8459 31.70/0.8379 34.25/0.8754
SAFMN14 240 18 34.86/0.8628 35.51/0.8791 40.72/0.9521 32.08/0.8464 33.10/0.8690 34.28/0.8758
RLKA-Net42 245 65 34.89/0.8637 35.51/0.8799 40.79/0.9524 32.11/0.8474 33.14/0.8691 34.28/0.8758
SMFANet34 197 14 34.86/0.8630 35.51/0.8791 40.75/0.9523 32.10/0.8469 33.16/0.8699 34.29/0.8760
SCNet60 154 28 34.77/0.8609 35.44/0.8778 40.54/0.9512 32.04/0.8455 33.12/0.8692 34.25/0.8753
LIRSRN43 70 5 34.20/0.8574 34.73/0.8734 39.01/0.9454 31.18/0.8267 32.41/0.8592 33.35/0.8706
CATANet25 535 64 34.91/0.8637 35.55/0.8800 40.83/0.9525 32.13/0.8483 33.20/0.8711 34.33/0.8765
HiT-SR26 866 77 34.91/0.8638 35.56/0.8801 40.86/0.9527 32.14/0.8485 33.22/0.8711 34.30/0.8763
HPEN(Ours) 445 44 34.93/0.8640 35.58/0.8803 40.88/0.9528 32.16/0.8487 33.25/0.8713 34.34/0.8768

We first conduct a systematic evaluation of the model’s channel dimension (“Dim”) and the number of HPEB modules (“Blocks”) to assess how different configurations affect performance, resource consumption, and efficiency. As shown in Table 1, the configuration with Dim=36 and Blocks=8 is selected as the baseline model, as it offers a favorable trade-off between efficiency and accuracy. Compared to a model of the same depth but wider channels (Dim=48, Blocks=8), this baseline reduces parameters and FLOPs by approximately 38.0% and 38.6%, respectively, while decreasing GPU memory usage by about 13.8%. When compared to the largest high-performance configuration (Dim=48, Blocks=12), the baseline achieves substantial savings of roughly 58.3% in parameters and 58.8% in FLOPs, while incurring an average performance drop of less than 0.5% across the three datasets. This configuration therefore balances resource efficiency and reconstruction accuracy effectively, justifying its selection as the baseline model.

Table 1.

Performance comparison of the proposed model across various channel configurations and module counts, with the baseline HPEN configuration highlighted in bold black.

Ablation Dim Blocks #Params [K] #FLOPs [G] #GPU Mem. [M] #Avg.Time [ms] resultsA resultsC M3FD
HPEN 36 8 445 44.1 115.98 44.63 34.93/0.8640 35.58/0.8803 33.25/0.8713
36 10 553 54.8 117.79 47.93 34.98/0.8641 35.60/0.8804 33.27/0.8714
36 12 660 65.5 120.21 50.04 35.02/0.8644 35.63/0.8806 33.29/0.8713
48 8 718 71.8 134.62 48.96 35.00/0.8645 35.65/0.8807 33.31/0.8716
48 10 892 89.4 138.04 54.32 35.08/0.8649 35.67/0.8810 33.35/0.8718
48 12 1066 107.0 141.73 59.06 35.09/0.8648 35.70/0.8811 33.34/0.8717

As shown in Table 2, the proposed HPEN exhibits competitive performance in both reconstruction quality and in computational efficiency. For the Inline graphic SR task, HPEN reaches a PSNR of 38.28dB on the IR700-test dataset and 39.83dB on the resultsA dataset, outperforming the strong baseline HiT-SR26 by 0.03% and 0.05%, respectively. Its advantage is more pronounced in the more challenging Inline graphic task, where HPEN attains 40.88dB on the CVC-09 dataset and 34.34dB on the my_test dataset, outperforming HiT-SR26 and CATANet25 by 0.05% and 0.03%. Notably, these superior results are achieved with significantly lower computational cost. In the Inline graphic setting, HPEN requires only 44G FLOPs, compared to 77G for HiT-SR26 and 64G for CATANet25, while still achieving a 0.06% higher PSNR on the IR700-test dataset. It is worth noting that while the LIRSRN43 maintains an extremely low parameter count and computational complexity (70K parameters and 5G FLOPs), it exhibits average performance degradation of 1.04dB in PSNR and 0.0102 in SSIM compared to HPEN, with the PSNR reduction reaching 4.6% specifically on the CVC-09 dataset. Table 2 also demonstrates that, excluding LIRSRN43, PSNR variations among all evaluated methods are confined within 0.35dB, with over 75% of the differences remaining below 0.10dB, making the 1.04dB performance gap particularly notable.

Additionally, we compare GPU memory usage and inference time of between the proposed method and mainstream methods, including RFDN30, ShuffleMixer33, RLKA-Net42, HNCT59, CATANet25 and HiT-SR26. As illustrated in Table 3, the proposed HPEN is highly efficient in terms of memory consumption, requiring only 115.98M, which represents a substantial reduction of approximately 90.58% and 83.54% compared to the most memory-intensive models, HiT-SR26 (1231.27M) and CATANet25 (704.38M), respectively. In terms of inference time, HPEN takes only 44.63ms, which is about 59.97% faster than CATANet (114.49ms) and 55.84% faster than HiT-SR (101.06ms). Although it ranks as the third fastest method behind RFDN and ShuffleMixer, its inference time remains highly competitive. Together with its leading PSNR/SSIM performance (Table 3), these results demonstrate that HPEN achieves a more favorable performance and efficiency trade-off than existing approaches. A visual summary of GPU memory and inference time across all methods is provided in Figure 1.

Table 3.

A comparison of computational efficiency for different Inline graphic SR methods. The analysis compares average inference time (#Avg.Time) and GPU memory consumption (#GPU Mem) across 50 test images with Inline graphic pixel resolution. All measurements were conducted under consistent hardware and software configurations to ensure comparative fairness.

Methods RFDN30 ShuffleMixer33 RLKA-Net42 HNCT59 CATANet25 HiT-SR26 HPEN(Ours)
#GPU Mem. [M] 59.97 201.39 46.12 116.04 704.38 1231.27 115.98
#Avg.Time [ms] 15.11 19.58 53.65 59.14 114.49 101.06 44.63

To further assess the perceptual quality of the reconstructed images, we evaluate all approaches using learned perceptual image patch similarity (LPIPS)61 metric under a unified, pre-trained AlexNet62 backbone to ensure a fair comparison. As reported in Table 4, the proposed HPEN achieves the lowest LPIPS scores across all test sets, indicating superior perceptual fidelity. Notably, on the CVC-09 dataset, HPEN attains a score of 0.0296, significantly outperforming other lightweight approaches. The stacked perceptual-loss comparison in Fig. 3 further confirms that HPEN consistently yields the smallest perceptual deviation. These results demonstrate that HPEN better preserves perceptually important details in infrared image SR, generating outputs that are perceptually closer to the original HR images.

Table 4.

A comparison of LPIPS values across different methods based on the same pre-trained network for Inline graphic SR results.

Scale SR results Model resultsA resultsC CVC-09 IR700-test M3FD my_test
Inline graphic CARN31 0.1252 0.1187 0.0613 0.1764 0.1509 0.1304
ShuffleMixer_tiny33 0.1142 0.1114 0.0486 0.1625 0.1420 0.1246
SAFMN14 0.1114 0.1109 0.0401 0.1594 0.1389 0.1214
LIRSRN43 0.1235 0.1242 0.0406 0.1691 0.1482 0.1294
HiT-SR26 0.1016 0.1040 0.0299 0.1552 0.1338 0.1176
HPEN(Ours)33 0.1015 0.1036 0.0296 0.1547 0.1331 0.1168

Fig. 3.

Fig. 3

Comprehensive comparison of LPIPS values between the proposed method and other lightweight approaches for Inline graphic SR results across all test datasets.

Qualitative comparisons. Figure 4 presents a visual comparison of Inline graphic SR results across multiple lightweight methods. The proposed HPEN exhibits excellent performance in both detail restoration and structural preservation. Notably, it effectively mitigates the over-smoothing artifacts commonly observed in outputs from IMDN11, BSRN13, ShuffleMixer33, and SAFMN14, preserving sharper edge contours and more complex textures. Furthermore, HPEN successfully recovers structural details that are inadequately reconstructed in HNCT59 and SMFANet34, which is particularly evident in the restoration of stripe patterns.

Fig. 4.

Fig. 4

Visual comparisons of different methods on image001 from the resultsA dataset for Inline graphic SR. The proposed HPEN reconstructs sharper edges and more authentic textures compared to other approaches.

LAM comparisons. Figure 5 compares the local attribute maps (LAMs)63 and corresponding diffusion indices (DIs)63 for image014 from the resultsC dataset. The proposed HPEN shows notably broader activation coverage and achieves a significantly higher DI of 3.25. Its extensive high-response regions (shown in red) visually reflect a stronger ability to integrate contextual information. This result contrasts sharply with the more limited attention range of HiT-SR26 (DI=1.35) and differs considerably from the restricted information utilization observed in other visualized models, all of which attain DI values no greater than 1.00.

Fig. 5.

Fig. 5

A comparison of the local attribute maps (LAMs)63 and corresponding diffusion indices (DIs)63 across different methods. The LAM visualizes each pixel’s contribution from the LR input to the final SR output within the marked region, while the DI quantifies the spatial dispersion of influential pixels, with higher values indicating more widespread attention distribution. The LAM visualizations and corresponding DI results substantiate that the proposed HPEN effectively utilizes extensive contextual information during the reconstruction process.

Ablation study

To validate the architectural design of our model, we conduct systematic ablation studies by progressively removing its key components. This allows us to assess the necessity of each part and quantify its contribution to overall performance. The quantitative results on six benchmark datasets, presented in Table 5, clearly illustrate the role each module plays in achieving the reported performance gains.

Table 5.

Comparison of Inline graphic SR performance results for HPEN and its architecture variants across the five test datasets. The FLOPs are measured using input images with Inline graphic pixel resolution. “A Inline graphic B” stands for replacing B with A, and “A Inline graphic None” represents removing operation A. “STL” is swin transformer layer. The baseline HPEN configuration is prominently highlighted in black bold font.

Ablation Variant Params[K] FLOPs[G] resultsA resultsC CVC-09 IR700-test M3FD my_test
Baseline HPEN 445 44 34.93/0.8640 35.58/0.8803 40.88/0.9528 32.16/0.8487 33.25/0.8713 34.34/0.8768
HPEB TAB Inline graphic None 295 32 Inline graphic0.04/Inline graphic0.0005 Inline graphic0.02/Inline graphic0.0004 Inline graphic0.09/Inline graphic0.0005 Inline graphic0.02/Inline graphic0.0009 Inline graphic0.09/Inline graphic0.0007 Inline graphic0.05/Inline graphic0.0007
MFEB Inline graphic None 260 20 Inline graphic0.04/Inline graphic0.0006 Inline graphic0.03/Inline graphic0.0006 Inline graphic0.08/Inline graphic0.0004 Inline graphic0.03/Inline graphic0.0010 Inline graphic0.06/Inline graphic0.0004 Inline graphic0.02/Inline graphic0.0004
Conv Inline graphic None 352 37 Inline graphic0.03/Inline graphic0.0004 Inline graphic0.02/Inline graphic0.0004 Inline graphic0.04/Inline graphic0.0002 Inline graphic0.01/Inline graphic0.0001 Inline graphic0.05/Inline graphic0.0003 Inline graphic0.03/Inline graphic0.0003
HPEB Inline graphic MFEB then TAB 445 44 Inline graphic0.03/Inline graphic0.0004 Inline graphic0.02/Inline graphic0.0004 Inline graphic0.08/Inline graphic0.0002 Inline graphic0.01/= Inline graphic0.03/Inline graphic0.0002 Inline graphic0.02/Inline graphic0.0004
HPEB Inline graphic TAB then TAB 410 33 Inline graphic0.02/Inline graphic0.0005 Inline graphic0.11/Inline graphic0.0005 Inline graphic0.06/Inline graphic0.0002 Inline graphic0.01/Inline graphic0.0006 Inline graphic0.05/Inline graphic0.0003 Inline graphic0.01/Inline graphic0.0004
HPEB Inline graphic MFEB then MFEB 481 56 Inline graphic0.04/Inline graphic0.0006 Inline graphic0.03/Inline graphic0.0006 Inline graphic0.10/Inline graphic0.0004 Inline graphic0.04/Inline graphic0.0012 Inline graphic0.08/Inline graphic0.0005 Inline graphic0.04/Inline graphic0.0006
TAB LN Inline graphic None 445 44 Inline graphic0.15/Inline graphic0.0019 Inline graphic0.13/Inline graphic0.0021 Inline graphic0.31/Inline graphic0.0024 Inline graphic0.08/Inline graphic0.0018 Inline graphic0.12/Inline graphic0.0009 Inline graphic0.14/Inline graphic0.0014
ConvFFN Inline graphic Channel MLP 421 37 Inline graphic0.03/Inline graphic0.0005 Inline graphic0.04/Inline graphic0.0006 Inline graphic0.07/= Inline graphic0.04/Inline graphic0.0005 Inline graphic0.03/Inline graphic0.0001 Inline graphic0.02/Inline graphic0.0004
ConvFFN Inline graphic FFN 437 40 Inline graphic0.01/Inline graphic0.0003 Inline graphic0.02/Inline graphic0.0003 Inline graphic0.05/Inline graphic0.0003 Inline graphic0.03/Inline graphic0.0002 Inline graphic0.02/= Inline graphic0.02/Inline graphic0.0003
MFEB DWConv Inline graphic None 445 44 Inline graphic0.05/Inline graphic0.0007 Inline graphic0.02/Inline graphic0.0005 Inline graphic0.10/Inline graphic0.0004 Inline graphic0.01/Inline graphic0.0009 Inline graphic0.05/Inline graphic0.0003 Inline graphic0.03/Inline graphic0.0005
DilaConv Inline graphic None 445 44 Inline graphic0.03/Inline graphic0.0005 Inline graphic0.02/Inline graphic0.0005 Inline graphic0.09/Inline graphic0.0004 Inline graphic0.01/Inline graphic0.0009 Inline graphic0.06/Inline graphic0.0004 Inline graphic0.02/Inline graphic0.0006
DWConv then Conv 495 51 Inline graphic0.05/Inline graphic0.0008 Inline graphic0.03/Inline graphic0.0007 Inline graphic0.12/Inline graphic0.0005 Inline graphic0.03/Inline graphic0.0011 Inline graphic0.07/Inline graphic0.0003 Inline graphic0.03/Inline graphic0.0006
DWConv then DWConv 401 41 Inline graphic0.04/Inline graphic0.0008 Inline graphic0.03/Inline graphic0.0008 Inline graphic0.10/Inline graphic0.0005 Inline graphic0.02/Inline graphic0.0013 Inline graphic0.04/Inline graphic0.0004 Inline graphic0.03/Inline graphic0.0007
DWConv then DilaConv 423 42 Inline graphic0.06/Inline graphic0.0007 Inline graphic0.03/Inline graphic0.0008 Inline graphic0.10/Inline graphic0.0004 Inline graphic0.04/Inline graphic0.0012 Inline graphic0.06/Inline graphic0.0004 Inline graphic0.03/Inline graphic0.0005
DilaConv then DWConv 423 42 Inline graphic0.04/Inline graphic0.0008 Inline graphic0.03/Inline graphic0.0008 Inline graphic0.10/Inline graphic0.0005 Inline graphic0.02/Inline graphic0.0013 Inline graphic0.04/Inline graphic0.0007 Inline graphic0.03/Inline graphic0.0007
RTL Inline graphic None 318 25 Inline graphic0.03/Inline graphic0.0005 Inline graphic0.03/Inline graphic0.0005 Inline graphic0.10/Inline graphic0.0004 Inline graphic0.03/Inline graphic0.0011 Inline graphic0.09/Inline graphic0.0006 Inline graphic0.03/Inline graphic0.0004
RTL Inline graphic STL 564 52 Inline graphic0.02/Inline graphic0.0002 Inline graphic0.01/Inline graphic0.0001 Inline graphic0.06/Inline graphic0.0003 Inline graphic0.02/Inline graphic0.0006 Inline graphic0.03/Inline graphic0.0001 Inline graphic0.02/Inline graphic0.0001

Effectiveness of hybrid perception enhancement block. To validate the effectiveness and synergistic interactions among the components of the HPEB module, we conduct comprehensive ablation studies. As summarized in Table 5, the full HPEN delivers optimal performance in the Inline graphic SR task, attaining a PSNR of 32.16dB on the IR700-test dataset. Removing the TAB module (“TAB Inline graphic None”) reduces the parameter count by 33.71% (to 295 K) but leads to notable performance degradation across all datasets; for instance, PSNR drops by 0.09dB (0.22%) on the CVC-09 dataset. Ablating either of the remaining components (“MFEB Inline graphic None” and “Conv Inline graphic None”) also harms model performance. Specifically, removing the MFEB module causes an average PSNR decrease of up to 0.04dB.

Notably, even when the overall model capacity remains unchanged, altering the processing order to MFEB followed by TAB (“HPEB Inline graphic MFEB then TAB”) substantially degrades performance. As shown in Table 5, this change reduces PSNR by 0.08dB (0.20%) on the CVC-09 dataset. Replacing HPEB with a dual-TAB structure (“HPEB Inline graphic TAB then TAB”) saves 7.19% in parameters, but at the cost of 0.11dB (0.31%) lower PSNR on the resultsC dataset, underscoring the essential role of the MFEB in multi-scale feature extraction for preserving infrared image structures. Conversely, a dual-MFEB configuration (“HPEB Inline graphic MFEB then MFEB”) increases parameters by 8.09% without improving any metric; instead, it reduces PSNR on the CVC-09 dataset by 0.10dB. Together, the results in Table 5 validate the critical importance of the complementary TAB and MFEB design in balancing model efficiency and reconstruction performance for infrared image SR.

Token aggregation block. As shown in Table 5, the ablation study confirms the clear superiority of the complete HPEN (Baseline) over all architectural variants. For example, removing LayerNorm (“LN Inline graphic None”) causes a marked decline in both PSNR and SSIM across datasets (e.g., resultsA drops from 34.93/0.8640 to 34.78/0.8621) without reducing parameters or computations, underscoring its essential role in training stability and representational capacity. Ablation results confirm that replacing ConvFFN with a channe MLP or standard FFN reduces parameters by 5.4% and 1.8%, respectively, yet causes notable performance drop (e.g., 0.07dB and 0.05dB PSNR decrease on the CVC-09 dataset). These comparisons clearly show that the complete TAB design in the Baseline is essential for achieving optimal reconstruction performance with high efficiency (445K parameters, 44G FLOPs). This design effectively integrates LayerNorm, dynamic token aggregation, and the convolutional feed-forward network (ConvFFN) into a synergistic unit.

Effectiveness of multi-scale feature enhancement block. To validate the effectiveness of the MFEB, we perform systematic ablation experiments. As reported in Table 5, removing the MFEB module (“MFEB Inline graphic None”) results in an average PSNR degradation of 0.040 dB across all test datasets. Furthermore, to examine the influence of different receptive fields, we evaluate several convolutional combinations. Table 5 shows that using only depth-wise convolution (“DilaConv Inline graphic None”) or only dilated convolution (“DWConv Inline graphic None”) leads to an average PSNR decrease of 0.03dB and 0.04dB across all test datasets, respectively. These results confirm that relying on a single type of feature extraction substantially degrades model performance.

To further assess the efficiency of MFEB’s dual-branch architecture, we conduct extensive comparative experiments. The results demonstrate that while the DWConv-Conv cascade (“DWConv then Conv”) increases parameter count by 11.24% yet it consistently degrades performance across all datasets, reducing PSNR by 0.05dB (0.14%) on the resultsA dataset. Similarly, other cascade configurations lead to considerable performance deterioration. In particular, the DilaConv-DWConv structure (“DilaConv then DWConv”) causes an average PSNR drop of 0.05dB across all evaluation datasets. These findings confirm that the carefully designed parallel branches in the MFEB deliver more efficient feature extraction compared to sequential alternatives.

Furthermore, we introduce the RTL module, which enables cross-region feature interaction through spatial attention mechanisms, thereby effectively modeling long-range dependencies in infrared images. To evaluate its contribution, we perform an ablation study by removing the RTL module (“RTL Inline graphic None”). This modification reduces model parameters by 28.5% compared to the baseline HPEN, but at the cost of performance degradation; the average PSNR and SSIM across five test datasets decreased by 0.044dB and 0.0006, respectively. Furthermore, although replacing RTL with the STL (“RTL Inline graphic STL”) further increases the number of parameters and computational load, its performance is almost at the same level as “RTL Inline graphic None”, indicating that the STL cannot effectively replace the lightweight designed RTL module in this architecture. In contrast, Baseline, with only 44 GFLOPs and 445 K parameters, still achieved a PSNR of 40.88 dB and an SSIM of 0.9528 on the CVC-09 datasets. These results indicate that while the RTL module increases model complexity, it is essential for maintaining high reconstruction quality in infrared SR tasks.

Conclusion

In this paper, we propose a lightweight hybrid perception enhancement network (HPEN) for infrared image super-resolution, whose core innovation is a novel hybrid perception enhancement block (HPEB). The HPEB architecture systematically integrates three complementary components: a Token Aggregation Block (TAB) for modeling long-range dependencies through dynamic feature reorganization, a multi-scale feature enhancement block (MFEB) for capturing fine-grained details via parallel dilated convolutions, and a Inline graphic convolution for feature refinement. This synergistic design enables comprehensive feature learning while maintaining computational efficiency. Extensive experiments show that HPEN achieves state-of-the-art performance among lightweight SR methods, effectively balancing reconstruction quality with practical computational demands. The compelling performance–efficiency trade-off makes HPEN suitable for real-world applications where both accuracy and resource constraints are critical, such as enhancing low-resolution infrared footage in surveillance systems, recovering fine thermal textures for industrial inspection and predictive maintenance, and assisting higher-resolution analysis in medical thermography.

While the proposed HPEN achieves strong performance in infrared image SR, it is important to acknowledge its current limitations: The model is trained exclusively on data synthesized via bicubic downsampling, which may not fully capture the complex and varied degradation patterns (e.g., sensor-specific noise, motion blur, atmospheric effects) present in real-world infrared imaging, thus its generalization capability to such authentic conditions requires further validation. Additionally, although the model is designed to be efficient, its computational and memory requirements could still be challenging for deployment on extremely resource-constrained edge devices with stringent power and latency budgets. Furthermore, the work focuses on scaling factors of Inline graphic and Inline graphic; its effectiveness on more extreme SR tasks (e.g., Inline graphic and above) remains unexplored and presents a direction for future research.

Acknowledgements

This research was funded by Natural Science Foundation of Xinjiang Uygur Autonomous Region (2022D01C461, 2022D01C460), “Tianshan Talents” Famous Teachers in Education and Teaching project of Xinjiang Uygur Autonomous Region (2025), and the “Tianchi Talents Attraction Project” of Xinjiang Uygur Autonomous Region (2024TCLJ04).

Author contributions

Z.L. main contribution was to propose the methodology of this work and write the paper. J.T. guided the entire research process, participated in writing the paper and secured funding for the research. C.L. participated in the implementation of the algorithm. G.Z. was responsible for algorithm validation and data analysis. All authors have read and agreed to the published version of the manuscript.

Data availability

The datasets used in this study are publicly available. The IR700 dataset can be accessed from the corresponding article. Additional test datasets (results-A and results-C, CVC09-100, IR700-test, M3FD) are also publicly available via their respective cited sources. The implementation code for HPEN is now available at https://github.com/smilenorth1/HPEN-main. Materials availability: The datasets used or analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally to this work: Zepeng Liu and Jiya Tian.

Contributor Information

Zepeng Liu, Email: smilenorth089@163.com.

Guodong Zhang, Email: 2023021@xjit.edu.cn.

References

  • 1.Zhan, Y., Hou, H., Sheng, G., Zhang, Y. & Jiang, X. Infrared image segmentation and temperature monitoring based on yolo model object detection results. In 2024 10th International Conference on Condition Monitoring and Diagnosis (CMD), 456–460, 10.23919/CMD62064.2024.10766200 (2024).
  • 2.Tuerniyazi, A., Lan, J., Zeng, Y., Hu, J. & Zhuo, Y. Multiview angle uav infrared image simulation with segmented model and object detection for traffic surveillance. Sci. Reports15, 1–18. 10.1038/s41598-025-89585-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wang, S., Du, Y., Zhao, S. & Gan, L. Multi-scale infrared military target detection based on 3x-fpn feature fusion network. IEEE Access11, 141585–141597. 10.1109/ACCESS.2023.3343419 (2023). [Google Scholar]
  • 4.Cui, H., Xu, Y., Zeng, J. & Tang, Z. The methods in infrared thermal imaging diagnosis technology of power equipment. In 2013 IEEE 4th International Conference on Electronics Information and Emergency Communication, 246–251, 10.1109/ICEIEC.2013.6835498 (2013).
  • 5.Wang, Q., Jin, P., Wu, Y., Zhou, L. & Shen, T. Infrared image enhancement: A review. IEEE J. Sel. Top. Appl. Earth Obs.Remote. Sens.18, 3281–3299. 10.1109/JSTARS.2024.3523418 (2025). [Google Scholar]
  • 6.Florio, C. et al. The potential of near infrared (nir) spectroscopy coupled to principal component analysis (pca) for product and tanning process control of innovative leathers. Sci. Reports15, 1–12. 10.1038/s41598-025-17598-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Dong, C., Loy, C. C., He, K. & Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 184–199, 10.1007/978-3-319-10593-2_13 (2014).
  • 8.Dong, C., Loy, C. C. & Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 391–407, 10.1007/978-3-319-46475-6_25 (2016).
  • 9.Kim, J., Lee, J. K. & Lee, K. M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1637–1645, 10.1109/CVPR.2016.181 (2016).
  • 10.Kim, J., Lee, J. K. & Lee, K. M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the conference on Computer Vision and Pattern Recognition, 1646–1654, 10.1109/CVPR.2016.182 (2016).
  • 11.Hui, Z., Gao, X., Yang, Y. & Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th acm international conference on Multimedia, 2024–2032, 10.1145/3343031.3351084 (2019).
  • 12.Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the conference on Computer Vision and Pattern Recognition, 105–114, 10.1109/CVPR.2017.19 (2017).
  • 13.Li, Z. et al. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 833–843, 10.1109/CVPRW56347.2022.00099 (2022).
  • 14.Sun, L., Dong, J., Tang, J. & Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13190–13199, 10.1109/ICCV51070.2023.01213 (2023).
  • 15.Shi, W., Caballero, J., Huszár, F., Totz, J. & Aitken, A. P. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1874–1883, 10.1109/CVPR.2016.207 (2016).
  • 16.Zhang, Y. et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the IEEE/CVF on European Conference on Computer Vision, 286–301, 10.1007/978-3-030-01234-2_18 (2018).
  • 17.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the conference on International Conference on Learning Representations, 10.48550/arXiv.2010.11929 (2021).
  • 18.Chen, H. et al. Pre-trained image processing transformer. In Proceedings of the conference on Computer Vision and Pattern Recognition, 12299–12310, 10.1109/CVPR46437.2021.01212 (2021).
  • 19.Zhou, Y. et al. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12734–12745, 10.1109/ICCV51070.2023.01174 (2023).
  • 20.Liang, J. et al. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1833–1844, 10.1109/ICCVW54120.2021.00210 (2021).
  • 21.Zamir, S. W. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 5728–5739, 10.48550/arXiv.2111.09881 (2022).
  • 22.Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 5998–6008, 10.48550/arXiv.1706.03762 (2017).
  • 23.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).
  • 24.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).
  • 25.Liu, X., Liu, J., Tang, J. & Wu, G. Catanet: Efficient content-aware token aggregation for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 10.48550/arXiv.2503.06896 (2025).
  • 26.Zhang, X., Zhang, Y. & Yu, F. Hit-sr: Hierarchical transformer for efficient image super-resolution. In Proceedings of the conference on European Conference on Computer Vision, 483–500, 10.48550/arXiv.2205.04437 (2024).
  • 27.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the conference on Computer Vision and Pattern Recognition, 770–778, 10.1109/CVPR.2016.90 (2016).
  • 28.Tai, Y., Yang, J. & Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 3147–3155, 10.1109/CVPR.2017.298 (2017).
  • 29.Lim, B., Son, S., Kim, H., Nah, S. & Lee, K. M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the conference on Computer Vision and Pattern Recognition Workshops, 136–144, 10.1109/CVPRW.2017.151 (2017).
  • 30.Liu, J., Tang, J. & Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision workshops, 41–55, 10.48550/arXiv.2009.11551 (2020).
  • 31.Ahn, N., Kang, B. & Sohn, K.-A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 252–268, 10.48550/arXiv.1803.08664 (2018).
  • 32.Hui, Z., Wang, X. & Gao, X. Fast and accurate single image super-resolution via information distillation network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 723–731, 10.1109/CVPR.2018.00082 (2018).
  • 33.Sun, L., Pan, J. & Tang, J. Shufflemixer: An efficient convnet for image super-resolution. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 17314–17326, 10.48550/arXiv.2205.15175 (2022).
  • 34.Zheng, M., Sun, L., Dong, J. & Pan, J. Smfanet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 359–375, 10.1007/978-3-031-72973-7_21 (2025).
  • 35.Chen, X., Wang, X., Zhou, J., Qiao, Y. & Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 22367–22377, 10.48550/arXiv.2205.04437 (2023).
  • 36.Chudasama, V. et al. Therisurnet-a computationally efficient thermal image super-resolution network. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 388–397, 10.1109/CVPRW50498.2020.00051 (2020).
  • 37.Huang, Y., Jiang, Z., Lan, R., Zhang, S. & Pi, K. Infrared image super-resolution via transfer learning and psrgan. IEEE Signal Process. Lett.28, 982–986. 10.1109/LSP.2021.3077801 (2021). [Google Scholar]
  • 38.Marivani, I., Tsiligianni, E., Cornelis, B. & Deligiannis, N. Multimodal deep unfolding for guided image super-resolution. IEEE Transactions on Image Process.29, 8443–8456. 10.1109/TIP.2020.3014729 (2020). [DOI] [PubMed] [Google Scholar]
  • 39.Ying, X. et al. Local motion and contrast priors driven deep network for infrared small target superresolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens.15, 5480–5495. 10.1109/JSTARS.2022.3183230 (2022). [Google Scholar]
  • 40.Liu, S. et al. Infrared image super-resolution via lightweight information split network. Image and Video Process.10.48550/arXiv.2405.10561 (2024). [Google Scholar]
  • 41.Qin, F. et al. Lkformer: large kernel transformer for infrared image super-resolution. Multimed Tools Appl83, 72063–72077. 10.1007/s11042-024-18409-3 (2024). [Google Scholar]
  • 42.Liu, G., Zhou, S., Chen, X., Yue, W. & Ke, J. Recurrent large kernel attention network for efficient single infrared image super-resolution. IEEE Access12, 923–935. 10.1109/ACCESS.2023.3344830 (2024). [Google Scholar]
  • 43.Lin, C.-A., Liu, T.-J. & Liu, K.-H. Lirsrn: A lightweight infrared image super-resolution network. IEEE Int. Symp. on Circuits Syst. (ISCAS) 1–5, 10.1109/ISCAS58744.2024.10558676 (2024).
  • 44.Juhartini, Dwinita, A. & Desmiwati. Single shot multibox detector (ssd) in object detection: A review. Int. J. Adv. Comput. Informatics1, 118–127, 10.71129/ijaci.v1i2.pp118-127 (2025).
  • 45.Ricki, S., Dicky, H. & Bahalwan, A. Fast region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics2, 34–40. 10.71129/ijaci.v2i1 (2025). [Google Scholar]
  • 46.Vivi, A. & Surni, E. Yolov8 for object detection: A comprehensive review of advances, techniques, and applications. Int. J. Adv. Comput. Informatics2, 53–61. 10.71129/ijaci.v2i1 (2025). [Google Scholar]
  • 47.Wang, Y., Li, Y., Wang, G. & Liu, X. Multi-scale attention network for single image super-resolution. J. Vis. Commun. Image Represent.80, 103300. 10.1016/j.jvcir.2021.103300 (2021). [Google Scholar]
  • 48.Wanda, I. & Abubakar, A. A comprehensive review of convnext architecture in image classification: Performance, applications, and prospects. Int. J. Adv. Comput. Informatics2, 108–114. 10.71129/ijaci.v2i2 (2025). [Google Scholar]
  • 49.Surni, E., Vivi, A. & Bahtiar, I. Mask region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics 1, 106–117, 10.71129/ijaci.v1i2.pp106-117 (2025).
  • 50.Ch, R., Shaik, J., Srikavya, R., Sahu, M. & Sahu, A. K. A novel fiestal structured chromatic series-based data security approach. Discov. Internet of Things5, 106. 10.1007/s43926-025-00162-0 (2025). [Google Scholar]
  • 51.Sahu, A. et al. Dual image-based reversible fragile watermarking scheme for tamper detection and localization. Pattern Analysis Appl.26, 571–590. 10.1007/s43926-025-00162-0 (2023). [Google Scholar]
  • 52.Zou, Y. et al. Super-resolution reconstruction of infrared images based on a convolutional neural network with skip connections. Opt. Lasers Eng.146, 106717. 10.1016/j.optlaseng.2021.106717 (2021). [Google Scholar]
  • 53.Liu, Y., Chen, X., Cheng, J., Peng, H. & Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavelets, Multiresolution Inf. Process.16, 1850018. 10.1142/S0219691318500182 (2018). [Google Scholar]
  • 54.Bai, Y. et al. Ibfusion: An infrared and visible image fusion method based on infrared target mask and bimodal feature extraction strategy. IEEE Transactions on Multimed.26, 10610–10622. 10.1109/TMM.2024.3410113 (2024). [Google Scholar]
  • 55.Liu, J. et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5811, 10.48550/arXiv.2203.16220 (2022).
  • 56.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1412.6980 (2015).
  • 57.Loshchilov, I. & Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1608.03983 (2017).
  • 58.Wang, L. et al. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 4915–4924, 10.1109/CVPR46437.2021.00488 (2021).
  • 59.Fang, J., Lin, H., Chen, X. & Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1102–1111, 10.1109/CVPRW56347.2022.00119 (2022).
  • 60.Wu, G., Jiang, J., Jiang, K. & Liu, X. Fully 1 1 convolutional network for lightweight image super-resolution. Mach. Intell. Res. 21, 1–15. 10.1007/s11633-024-1501-9 (2025).
  • 61.Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 586–595, 10.48550/arXiv.1801.03924 (2018).
  • 62.Krizhevsky, A., Sutskever, I. & Hinton, G. E. The unreasonable effectiveness of deep features as a perceptual metric. In Advances in Neural Information Processing Systems10.1145/3065386 (2012).
  • 63.Gu, J. & Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9199–9208, 10.48550/arXiv.2011.11036 (2021).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used in this study are publicly available. The IR700 dataset can be accessed from the corresponding article. Additional test datasets (results-A and results-C, CVC09-100, IR700-test, M3FD) are also publicly available via their respective cited sources. The implementation code for HPEN is now available at https://github.com/smilenorth1/HPEN-main. Materials availability: The datasets used or analysed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES