A lightweight hybrid perception enhancement network for infrared image super-resolution

Zepeng Liu; Jiya Tian; Chao Liu; Guodong Zhang

doi:10.1038/s41598-026-37763-w

. 2026 Jan 29;16:6572. doi: 10.1038/s41598-026-37763-w

A lightweight hybrid perception enhancement network for infrared image super-resolution

Zepeng Liu ^1,^✉,^#, Jiya Tian ^1,^#, Chao Liu ¹, Guodong Zhang ^1,^✉

PMCID: PMC12909808 PMID: 41611970

Abstract

Infrared image super-resolution (SR) remains a challenging task due to inherent limitations in existing approaches: convolutional neural network (CNN)-based approaches struggle with long-range dependency modeling, whereas transformer-based approaches are computationally expensive and tend to overlook fine local details. To address these issues, we propose a novel hybrid perception enhancement network (HPEN). Its core component is a hybrid perception enhancement block (HPEB), which effectively combines a token aggregation block (TAB) for global context modeling, a multi-scale feature enhancement block (MFEB) for local detail extraction, and a convolutional layer for feature refinement. Extensive experimental results demonstrate that the proposed HPEN achieves leading performance among compared methods. For the challenging Inline graphic SR task, it attains the best PSNR and SSIM values among the evaluated lightweight SR approaches, while demonstrating remarkable efficiency advantages. Specifically, compared to HiT-SR, HPEN reduces FLOPs by 42.9%, uses only 9.4% of the GPU memory usage, and delivers a faster inference speed. The code is available at https://github.com/smilenorth1/HPEN-main.

Keywords: Infrared image super-resolution, Hybrid perception, Token aggregation, Multi-scale

Subject terms: Computational science, Computer science, Scientific data

Introduction

Infrared imaging plays a vital role in several critical fields due to its all-weather capability and ability to capture thermal information. For instance, in surveillance and security systems^1,2, it enables person and vehicle detection under low-visibility conditions; in defense and aerospace³, it supports target recognition and night-vision navigation; in medical diagnostics^4,5, it assists in non-invasive screening such as inflammation detection and vascular imaging; and in industrial inspection⁶, it helps identify overheating components or structural defects in power equipment and machinery. However, the spatial resolution of infrared sensors is inherently limited, which often restricts their practical usefulness. Infrared image super-resolution (SR) offers a promising solution to enhance image quality beyond the constraints of physical hardware. Although existing methods have made considerable progress, they still face a fundamental challenge in balancing reconstruction accuracy, particularly in preserving fine details with computational efficiency required for real-time deployment.

Convolutional Neural Networks (CNNs)^7–14 have become the dominant paradigm in image SR, leveraging their inherent strengths in local feature extraction and hierarchical representation learning. However, CNN-based methods^7,8,10,15 are constrained by three fundamental limitations: (1) their restricted receptive fields impede the modeling of long-range dependencies; (2) static convolution kernels lack adaptability to multi-scale patterns; and (3) conventional architectures often fail to support effective cross-resolution interactions. Although some efforts have attempted to address these issues by designing deeper and wider networks (e.g., RCAN¹⁶ with over 400 layers), such expansions come at the cost of prohibitive computational complexity, which severely limits their practical deployment in efficiency-sensitive real-world scenarios.

Recent advances in vision transformer (ViT) architectures^17–21 have demonstrated remarkable capabilities in image SR, primarily attributed to the global receptive field and powerful context modeling enabled by the self-attention (SA) mechanism. However, two critical limitations remain: (1) the computational complexity of self-attention leads to high memory and processing requirements^17,18,22; and (2) its focus on long-range dependencies often results in inadequate representation of high-frequency local details, such as edges and textures^23,24. Although subsequent studies have attempted to alleviate these issues through various ways (e.g., local self attention in clustering²⁵, and hierarchical windows²⁶), there are still challenges in balancing lightweight and detail preservation for infrared image SR.

To address these issues, we propose a novel hybrid perception enhancement block (HPEB) that systematically combines three complementary components: a token aggregation block (TAB) for comprehensive global feature modeling, a multi-scale feature enhancement block (MFEB) for extracting local-contextual details, and a Inline graphic convolution for cross-channel feature refinement and spatial information optimization. Specifically, the TAB module follows a sophisticated three-stage process: it first groups similar image patches using learnable token centers updated via exponential moving average; then employs a dual-branch architecture integrating intra-group self-attention (IASA) and inter-group cross-attention (IRCA) to fuse local refinements with global semantics; and finally applies a convolutional feed-forward network (ConvFFN) to further enhance local feature representation. The MFEB adopts a parallel strategy, using dual-branch dilated convolutions (with dilation rates of 1 and 2) to capture both fine details and broader context, dynamically fuses these multi-scale features via a Inline graphic convolution, and subsequently strengthens global dependencies through restormer Layers (RTL). By integrating these carefully designed modules, we construct an end-to-end trainable network named HPEN. Extensive experimental results demonstrates that the proposed HPEN achieves an optimal balance between computational efficiency and reconstruction quality, as quantitatively shown in Fig. 1.

Fig. 1 — Comprehensive comparison between the proposed method and other lightweight approaches for SR on the CVC-09 dataset. The left subplot illustrates the reconstruction quality (PSNR) against both parameter count and computational complexity (FLOPs), with circle sizes representing FLOPs magnitude. The right subplot demonstrates PSNR in relation to inference time and GPU memory consumption, where circle sizes correspond to GPU memory usage. Experimental results indicate that the proposed HPEN achieves a favorable balance between computational overhead (parameters, FLOPs, runtime, and GPU memory) and reconstruction performance (PSNR) compared to existing lightweight SR models.

Inline graphic — Comprehensive comparison between the proposed method and other lightweight approaches for SR on the CVC-09 dataset. The left subplot illustrates the reconstruction quality (PSNR) against both parameter count and computational complexity (FLOPs), with circle sizes representing FLOPs magnitude. The right subplot demonstrates PSNR in relation to inference time and GPU memory consumption, where circle sizes correspond to GPU memory usage. Experimental results indicate that the proposed HPEN achieves a favorable balance between computational overhead (parameters, FLOPs, runtime, and GPU memory) and reconstruction performance (PSNR) compared to existing lightweight SR models.

The main contributions of this work can be summarized as follows:

We propose a novel hybrid perception enhancement block (HPEB) that effectively integrates global structural modeling with local detail preservation in a unified architecture, enabling simultaneous capture of long-range dependencies and fine-grained textures for high-quality infrared image super-resolution.
We develop a multi-scale feature enhancement block (MFEB) that applies dual-branch dilated convolutions to extract local-contextual information and integrates restormer Layers (RTLs) to enhance global dependencies, thereby establishing a collaborative local-global optimization framework.
We conduct comprehensive evaluation of the proposed HPEN model across six datasets, demonstrating its ability to achieve an optimal trade-off between model complexity and reconstruction performance.

Related works

CNN-based SR methods

CNNs have established themselves as a cornerstone of modern computer vision, owing to their exceptional ability to learn hierarchical feature representations. This dominance naturally extends to the field of image SR. The pioneering SRCNN⁷ first introduces an end-to-end CNN framework for single image SR. Subsequently, VDSR¹⁰ advances training efficiency by incorporating residual learning²⁷ and optimized convergence strategies. Later innovations, such as DRCN⁹ and DRRN²⁸, further improve reconstruction quality through recursive structures and enhanced residual connections²⁷. SRGAN¹² revolutionizes perceptual quality by employing adversarial training to overcome the limitations of conventional loss functions. However, constrained by the inherently local receptive field of convolution operations, these methods struggle to capture long-range dependencies and global contextual information, leading to limited representational capacity. To address this issue, researchers have progressively increased network scale and complexity. For instance, EDSR²⁹ significantly boosts performance by expanding to 32 layers and 43 million parameters, while RCAN¹⁶ constructs the architecture exceeding 400 layers through channel attention and residual connections. Nevertheless, this pursuit of performance through greater depth and width comes at a significant cost: the high computational complexity and memory footprint of these large models severely hinder their deployment in real-world applications scenarios.

To address computational complexity, numerous lightweight CNN^{8,11,13,14,30} designs have been developed to balance performance and efficiency through various strategies. For instance, FSRCNN⁸ and ESPCN¹⁵ improve efficiency by employing post-upsampling operations. CARN³¹ achieves fast and accurate SR using group convolution and cascaded feature integration. The concept of information distillation, introduced by IDN³² through enhancement and compression units, is further advanced by IMDN¹¹ with multi-distillation modules that optimize the memory-time trade-off. RFDN³⁰ further refines this concept with enhanced feature distillation links, winning the NTIRE 2022 Efficient SR Challenge. BSRN¹³ reduces computational redundancy via blueprint separable convolution, while ShuffleMixer³³ minimizes FLOPs by employing large kernels combined with channel operations. More recent approaches incorporate dynamic feature modulation for greater flexibility. For example, SAFMN¹⁴ implements spatially adaptive modulation within a ViT-like architecture for long-range dependency modeling, and SMFANet³⁴ leverages dual-branch processing to handle both local and non-local features. Despite these advancements, such methods remain fundamentally constrained by the use of fixed convolutional kernels, leading to edge blurring and detail loss. To overcome these limitations, we propose a hybrid perception enhancement block (HPEB), which facilitates effective global modeling and hierarchical feature interactions, thereby significantly improving reconstruction quality while maintaining high computational efficiency.

ViT-based SR methods

ViTs¹⁷ have recently shown strong potential for image SR, primarily due to their ability to model long-range dependencies and capture global context information. The pioneering work IPT¹⁸ first adapts the transformer architecture for SR tasks, establishing new performance benchmarks. However, the standard ViT design incurs very high computational costs, mainly because its global SA mechanism²² scales quadratically with input size. To mitigate this, several efficient variants have been developed. SwinIR²⁰ employs shifted window attention to limit computation to local windows while maintaining cross-window connection, greatly reducing complexity. HAT³⁵ further improves this approach by integrating channel attention with overlapping cross-attention, achieving state-of-the-art results. Restormer²¹proposes a complementary strategy using multi-head transposed attention to efficiently capture global context²⁵. improves parallel processing through subgroup token balancing and incorporates residual local relation self-attention to harmonize global and local representations under linear complexity, while HiT-SR²⁶ employs hierarchical window expansion to capture multi-scale contexts and long-range dependencies for efficient SR reconstruction.

Despite these advancements, current ViT-based SR methods still struggle to optimally balancing performance and efficiency. Most either sacrifice reconstruction quality to maintain reasonable computational costs, or achieve superior performance at the expense of prohibitive complexity. This fundamental trade-off highlights the need for a more balanced approach that can maintain high reconstruction quality while ensuring practical deploy-ability.

Infrared image SR methods

Deep learning has significantly advanced infrared image processing by demonstrating powerful capabilities in automated feature extraction and complex nonlinear mapping. Infrared image SR, in particular, has emerged as a promising application. Numerous deep learning methods have been developed specifically for this domain: TherISuRNet³⁶ utilizes progressive upscaling with asymmetric residual learning, while PSRGAN³⁷ a dual-path architecture that leverages visible light imagery to compensate for limited IR training data. Marivani et al.³⁸ incorporate sparse priors within a multi-modal framework, and MoCoPnet³⁹ integrates domain knowledge to address the inherent feature scarcity in IR small target SR. For efficient deployment, LISN⁴⁰ achieves lightweight reconstruction through feature correlation aggregation and channel operations, while LKFormer⁴¹ replaces self-attention with large-kernel depth-wise convolutions for non-local modeling and employs gated-pixel feed-forward networks to improve information flow. Additionally, RLKA-Net⁴² employs recurrent strategies and large kernel attention to achieve efficient infrared SR. LIRSRN⁴³ incorporates an attention enhancement module with spatial-frequency processing. Despite these advancements, existing methods still face limitations in optimally balancing reconstruction performance with computational efficiency. Most approaches either prioritize accuracy at the cost of practical deploy-ability, or achieve efficiency while compromising the restoration of fine details that are critical for infrared applications, highlighting the need for more balanced solutions in infrared SR domain.

In recent years, general-purpose computer vision architectures have achieved remarkable progress in tasks such as detection and classification. The object detection frameworks such as SSD⁴⁴, Faster R-CNN⁴⁵, and YOLOv8⁴⁶ have achieved strong performance in real-time processing, small object detection, and complex scene understanding by integrating lightweight designs^11,30, multi-scale feature fusion^14,34, and attention mechanisms^13,42,47. Similarly, modern convolutional architectures like ConvNeXt⁴⁸ have attained near-state-of-the-art accuracy in classification tasks, particularly in medical imaging through deep integration with attention modules. Mask R-CNN⁴⁹ has also achieved significant results in the field of instance segmentation. Beyond perceptual tasks, related research in image security has introduced methods such as a structured chromatic encryption algorithm⁵⁰ that combines color mapping with mathematical transformations to protect sensitive medical multimedia data, and a reversible fragile watermarking scheme⁵¹ that embeds two bits per pixel to generate dual watermarked images, achieving high capacity and robustness while preserving visual transparency. These advances collectively reflect a common design principle: balancing performance and complexity through architectural refinement and efficient attention fusion. However, these studies primarily target high-level vision tasks such as detection and classification in the visible spectrum, and their design paradigms are not directly applicable to infrared image super-resolution, a representative low-level vision task. Infrared imagery exhibits unique characteristics, including thermal radiation properties, low contrast, and complex noise patterns, which demand specialized models capable of enhanced detail recovery and targeted degradation modeling. Inspired by the “hybrid architecture” and “lightweight attention” principles in these advanced works, this paper focuses on addressing the specific challenges of infrared image SR. We propose the HPEN network, which incorporates a core hybrid perception enhancement block. By combining customized token aggregation with multi-scale convolutional enhancement, our approach achieves synergistic improvement of both global structures and local details in infrared images, establishing an improved balance between reconstruction accuracy and computational efficiency.

Proposed method

We propose a simple yet effective SR model that delivers accurate infrared image reconstruction through collaborative exploiting local and non-local features. The core innovation is the hybrid perception enhancement block (HPEB), a unified architecture that systematically combines complementary feature processing mechanisms. The HPEB operates in three sequential stages: (1) a token aggregation block (TAB) to capture global contextual dependencies; (2) a multi-scale feature enhancement module (MFEM) to extract fine-grained local details; and (3) a Inline graphic convolutional layer for feature refinement. This synergistic design allows the model to focus on information-rich regions while maintaining computational efficiency, thereby establishing a new balance between performance and complexity in infrared SR. The details of each component are elaborated below.

Figure 2 illustrates the overall architecture of the proposed HPEN network. It consists of three main components: shallow feature extraction, deep feature extraction, and an upsampler module. Specifically, a Inline graphic convolutional layer is first applied for the low-resolution input to extract shallow features, denoted as . This process can be formulated as:

where Inline graphic denotes shallow feature extraction operation. These shallow features are then fed into a stack of hybrid perception enhancement blocks (HPEBs). Each HPEB contains a token aggregation block (TAB), a multi-scale feature enhancement module (MFEM), and a convolutional layer. This procedure can be expressed as:

where Inline graphic represents the deep feature extraction process carried out by the stacked HPEBs, and refers to the resulting deep feature representation. To facilitate gradient flow and preserve low-frequency information, we incorporate global residual connections. This allows the network to concentrate on reconstructing high-frequency details. The upsampler module, which consists of a Inline graphic convolution followed by a sub-pixel convolution layer¹⁵, then efficiently reconstructs the high-resolution output. This process can be represented as:

where Inline graphic is the upsampler module that reconstructs the high-resolution image, yielding the final output . For loss function, we adopt the loss, following the previous work^34,42 in SR tasks, which is defined as:

where Inline graphic represents the ground-truth high-resolution image, and denotes the -norm operator.

Fig. 2 — Overall architecture of the proposed HPEN. The network comprises three fundamental components: an initial convolutional layer for shallow feature extraction, a series of hybrid perception enhancement block (HPEBs) for deep feature extraction, and a final upsampling module for high-resolution reconstruction. Each HPEB represents the core building block and integrates three carefully designed sub-modules: a token aggregation block (TAB) for global dependency modeling, a multi-scale feature enhancement block (MFEB) for local detail extraction, and a convolutional layer for feature refinement and integration.

Token aggregation block

In image SR tasks, conventional CNN-based methods^14,29,30 are limited by the fixed receptive fields of their convolutional kernels, while transformer-based approaches^18,20,35 often incur excessive computational costs and lack inherent spatial inductive bias despite their global modeling capacity. To address these issues, we introduce the token aggregation block (TAB) adapted from CATANet²⁵, which efficiently captures feature interactions through content-aware dynamic reorganization and a hierarchical attention mechanism. As illustrated in Fig. 2(b), TAB consists of three core components: the content-aware token aggregation (CATA) module, intra-group self-attention (IASA), and inter-group cross-attention (IRCA). Given an input feature Inline graphic , the CATA module dynamically clusters image patches using a set of learnable token centers. These centers are updated during training via an exponential moving average (EMA) strategy with a decay parameter =0.999 to enable semantically coherent grouping. The CATA operation can be expressed as:

where Inline graphic denotes the layer normalization operation, represents the clustered features. The IASA module then performs fine-grained feature interactions within each token subgroup. Overlapped grouping is used to preserve local spatial continuity. This operation can be written as:

where Inline graphic corresponds to the output feature of the IASA module. Subsequently, the IRCA module facilitates global semantic propagation by enabling interaction across token groups and a global center token. The IRCA module is formulated as follows:

where Inline graphic denotes the output feature of the IRCA module. Finally, a convolutional feed-forward network (ConvFFN) is applied to further integrates local features and channel information, thereby enhancing the model’s nonlinear representation capacity. This step is expressed as:

where Inline graphic denotes the convolution, represents the intermediate aggregated features, corresponds to the convolution-enhanced feed-forward network, and is the final output of the TAB module.

Multi-scale feature enhancement block

While the TAB focuses on modeling global dependencies through dynamic token aggregation, high-quality image reconstruction also critically depends on fine local details. To this end, we propose a multi-scale feature enhancement block (MFEB) that explicitly enhances local feature extraction while integrating global contextual information, forming a collaborative local-global optimization framework. As shown in Fig. 2(c), given an input feature Inline graphic , the MFEB first splits it into two parallel branches, formulated as:

where Inline graphic denotes the channel splitting operation. One branch then applies a convolution to capture fine-grained textures, while the other utilizes a dilated convolution (dilation rate=2) to model broader contextual information. This process can be formulated as:

where Inline graphic represents a dilated convolution with a dilation rate of 2, indicates a depth-wise convolution of kernel size , and refers to the GELU activation function. The outputs of the two branches are then dynamically fused via a convolution. The fused process can be expressed as:

where Inline graphic denotes the channel concatenation operation. To facilitate gradient flow across network levels, we incorporate residual connections. Finally, two successive restormer layers (RTLs)²¹ are applied to further strengthen global feature dependencies. This process can be written as:

where Inline graphic represents the restormer layer transformation.

Restormer layer

To enhance long-range dependency modeling and feature representation, we incorporate the RTL²¹. This module combines multi-dconv head transposed attention (MDTA) and gated-dconv feed-forward network (GDFN) to improve global context modeling and nonlinear transformation of fused features. As illustrated in Fig. 2(d), given an input Inline graphic , the MDTA module aggregates both local and non-local pixel interactions, formulated as:

where Inline graphic denotes the output of the MDTA module. The GDFN then dynamically modulates feature transformation by suppressing less informative components and selectively propagating relevant information through the network. The operation can be formulated as:

where Inline graphic represents the output feature of the GDFN module, and is defined as before.

Hybrid perception enhancement block

We construct the HPEB by integrating the TAB, MFEB, and a Inline graphic convolutional layer in sequence. This three-stage cascaded design enables progressive feature refinement. Specifically, the TAB first performs content-aware global feature reorganization via dynamic token aggregation for the input. Given an input , this step is expressed as:

where Inline graphic represents the output of the TAB module. The MFEB module then enhances local-contextual representations using multi-scale convolutions (dilation rates of 1 and 2), formulated as;

where Inline graphic represents the output of the MFEB module. Finally, a convolution preserves multi-granularity details, preventing high-frequency information from being over-smoothed by preceding transformer operations, expressed as:

where Inline graphic denotes a convolution operation, and represents the final output of the HPEB.

Experimental results

Datasets and implementation details

Datasets. We use the IR700 dataset⁵² as the source of high-resolution (HR) images. This dataset comprises 700 infrared images with a resolution of Inline graphic , encompassing diverse scenarios including urban landscapes, vegetation, pedestrians, vehicles, and low-visibility conditions at night. Its authentic noise distribution and atmospheric degradation characteristics help mitigate the domain shift problem common in synthetically generated data. The dataset is partitioned into training, validation, and test sets in an 8:1:1 ratio, resulting in 560, 70, and 70 images, respectively. Low-resolution (LR) images are synthesized by applying bicubic downsampling to the HR images. For evaluation, we utilize five test datasets: results-A⁵³ and results-C⁵⁴, CVC09-100⁵³, IR700-test⁵², M3FD⁵⁵, and my_test. We quantitatively evaluate the reconstruction quality using two standard metrics: peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), both computed on the luminance (Y) channel.

Implementation details. The model is trained with the following configuration. The LR input patches of size Inline graphic are randomly cropped and augmented via random horizontal flipping and rotation. We train the network for 100,000 iterations with a batch size of 16, using the Adam optimizer⁵⁶, and and a MultiStepLR⁵⁷ learning rate scheduler. The initial learning rate is set to and decays to a minimum of Inline graphic . The network employs 8 HPEBs with 36 feature channels. All experiments are conducted on a Linux server using PyTorch 1.13.0 and CUDA 11.8 (i.e., PyTorch, using an NVIDIA GeForce RTX 3090 GPU (24GB VRAM), an Intel Xeon E5-2620 v4 CPU, and 32GB of system memory.

Comparisons with state-of-the-art methods

Quantitative comparison. The proposed method is compared with state-of-the-art lightweight SR approaches, including CARN³¹, IMDN¹¹, RFDN³⁰, BSRN¹³, SMSR⁵⁸, ShuffleMixer³³, HNCT⁵⁹, PSRGAN³⁷, SAFMN¹⁴, RLKA-Net⁴², SMFANet³⁴, SCNet⁶⁰, LIRSRN⁴³, CATANet²⁵, and HiT-SR²⁶. Quantitative results for Inline graphic and magnification factors are summarized in Table 2. In addition to the standard PSNR and SSIM metrics, we also report the number of parameters (#Params) and computational complexity (#FLOPs). The FLOPs are uniformly measured using the fvcore library (i.e., fvcore.nn.flop_count_str) with an output resolution of Inline graphic .

Table 2.

A comprehensive comparison with lightweight SR methods on five test datasets is presented. All reported PSNR and SSIM values are calculated on the luminance (Y) channel in the transformed color space. Computational complexity (#FLOPs) is evaluated using high-resolution images with a fixed resolution of Inline graphic pixels. The top two indicators are highlighted in italic and bold italic font respectively for clear distinction.

Model	Params(K)	FLOPs(G)	resultsA	resultsC	CVC-09	IR700-test	M3FD	my_test
CARN³¹	1592	297	39.75/0.9489	40.75/0.9589	44.83/0.9737	38.14/0.9435	40.20/0.9676	37.08/0.9157
IMDN¹¹	715	218	39.77/0.9491	40.76/0.9589	44.87/*0.9740*	38.19/0.9438	40.28/0.9677	37.09/0.9159
RFDN³⁰	417	122	39.76/0.9491	40.77/0.9590	44.88/0.9739	38.18/0.9437	40.13/0.9670	37.11/*0.9161*
BSRN¹³	332	96	39.77/0.9495	40.78/*0.9593*	44.85/0.9739	38.22/0.9440	40.24/0.9673	37.10/0.9160
SMSR⁵⁸	987	468	39.76/0.9491	40.76/0.9590	44.84/0.9737	38.14/0.9435	40.20/0.9669	37.08/0.9158
ShuffleMixer_tiny³³	247	76	39.74/0.9490	40.76/0.9589	44.85/0.9738	38.17/0.9437	40.16/0.9673	37.08/0.9158
HNCT⁵⁹	357	111	39.78/0.9492	40.78/0.9591	44.88/0.9739	38.21/0.9440	40.28/0.9679	37.11/0.9159
PSRGAN³⁷	313	-	39.76/0.9490	40.71/0.9589	44.79/0.9731	38.16/0.9434	38.24/0.9524	37.09/0.9158
SAFMN¹⁴	228	69	39.75/0.9491	40.64/0.9589	44.86/0.9739	38.13/0.9435	40.29/0.9679	37.08/0.9158
RLKA-Net⁴²	225	253	39.79/0.9493	40.78/0.9590	44.83/0.9737	38.19/0.9438	40.05/0.9657	37.09/0.9158
SMFANet³⁴	186	52	39.77/0.9493	40.78/0.9592	44.85/0.9738	38.11/0.9434	40.28/0.9677	37.06/0.9157
SCNet⁶⁰	146	56	39.74/0.9489	40.74/0.9588	44.82/0.9737	38.08/0.9432	40.29/0.9679	37.06/0.9157
LIRSRN⁴³	49	14	39.40/0.9475	40.11/0.9573	43.98/0.9730	37.03/0.9364	39.46/0.9658	36.27/0.9087
CATANet²⁵	477	197	39.80/0.9493	40.79/0.9590	44.89/0.9738	38.26/*0.9441*	40.34/0.9684	37.09/0.9159
HiT-SR²⁶	847	299	*39.81/0.9496*	*40.81*/0.9592	*44.90/0.9741*	*38.27*/0.9440	*40.36/0.9683*	*37.12*/0.9160
HPEN(Ours)	434	173	39.83/0.9497	40.82/0.9594	44.92/*0.9740*	38.28/0.9442	40.37/0.9684	37.13/0.9162
CARN³¹	1592	121	34.77/0.8615	35.43/0.8783	40.54/0.9513	31.87/0.8447	32.93/0.8666	34.22/0.8752
IMDN¹¹	715	55	34.83/0.8626	35.51/0.8791	40.73/0.9521	31.93/0.8450	33.07/0.8679	34.27/0.8757
RFDN³⁰	433	32	34.86/0.8630	35.50/0.8793	40.75/0.9523	32.01/0.8462	33.09/0.8679	34.26/0.8757
BSRN¹³	352	26	34.90/0.8636	*35.56*/0.8799	40.77/0.9523	32.07/0.8472	33.14/0.8702	34.29/0.8761
SMSR⁵⁸	1008	119	34.88/0.8631	35.52/0.8792	40.74/0.9521	32.08/0.8465	33.09/0.8700	34.26/0.8754
ShuffleMixer_tiny³³	251	21	34.79/0.8621	35.49/0.8788	40.60/0.9516	31.92/0.8447	32.99/0.8673	34.23/0.8753
HNCT⁵⁹	373	29	*34.92/0.8638*	35.58/*0.8802*	40.84/*0.9527*	32.12/0.8480	33.15/0.8699	34.30/0.8761
PSRGAN³⁷	350	-	34.81/0.8616	35.47/0.8790	40.67/0.9516	32.04/0.8459	31.70/0.8379	34.25/0.8754
SAFMN¹⁴	240	18	34.86/0.8628	35.51/0.8791	40.72/0.9521	32.08/0.8464	33.10/0.8690	34.28/0.8758
RLKA-Net⁴²	245	65	34.89/0.8637	35.51/0.8799	40.79/0.9524	32.11/0.8474	33.14/0.8691	34.28/0.8758
SMFANet³⁴	197	14	34.86/0.8630	35.51/0.8791	40.75/0.9523	32.10/0.8469	33.16/0.8699	34.29/0.8760
SCNet⁶⁰	154	28	34.77/0.8609	35.44/0.8778	40.54/0.9512	32.04/0.8455	33.12/0.8692	34.25/0.8753
LIRSRN⁴³	70	5	34.20/0.8574	34.73/0.8734	39.01/0.9454	31.18/0.8267	32.41/0.8592	33.35/0.8706
CATANet²⁵	535	64	34.91/0.8637	35.55/0.8800	40.83/0.9525	32.13/0.8483	33.20/*0.8711*	*34.33/0.8765*
HiT-SR²⁶	866	77	34.91/0.8638	*35.56*/0.8801	*40.86/0.9527*	*32.14/0.8485*	*33.22/0.8711*	34.30/0.8763
HPEN(Ours)	445	44	34.93/0.8640	35.58/0.8803	40.88/0.9528	32.16/0.8487	33.25/0.8713	34.34/0.8768

Open in a new tab

We first conduct a systematic evaluation of the model’s channel dimension (“Dim”) and the number of HPEB modules (“Blocks”) to assess how different configurations affect performance, resource consumption, and efficiency. As shown in Table 1, the configuration with Dim=36 and Blocks=8 is selected as the baseline model, as it offers a favorable trade-off between efficiency and accuracy. Compared to a model of the same depth but wider channels (Dim=48, Blocks=8), this baseline reduces parameters and FLOPs by approximately 38.0% and 38.6%, respectively, while decreasing GPU memory usage by about 13.8%. When compared to the largest high-performance configuration (Dim=48, Blocks=12), the baseline achieves substantial savings of roughly 58.3% in parameters and 58.8% in FLOPs, while incurring an average performance drop of less than 0.5% across the three datasets. This configuration therefore balances resource efficiency and reconstruction accuracy effectively, justifying its selection as the baseline model.

Table 1.

Performance comparison of the proposed model across various channel configurations and module counts, with the baseline HPEN configuration highlighted in bold black.

Ablation	Dim	Blocks	#Params [K]	#FLOPs [G]	#GPU Mem. [M]	#Avg.Time [ms]	resultsA	resultsC	M3FD
HPEN	36	8	445	44.1	115.98	44.63	34.93/0.8640	35.58/0.8803	33.25/0.8713
	36	10	553	54.8	117.79	47.93	34.98/0.8641	35.60/0.8804	33.27/0.8714
	36	12	660	65.5	120.21	50.04	35.02/0.8644	35.63/0.8806	33.29/0.8713
	48	8	718	71.8	134.62	48.96	35.00/0.8645	35.65/0.8807	33.31/0.8716
	48	10	892	89.4	138.04	54.32	35.08/0.8649	35.67/0.8810	33.35/0.8718
	48	12	1066	107.0	141.73	59.06	35.09/0.8648	35.70/0.8811	33.34/0.8717

Open in a new tab

As shown in Table 2, the proposed HPEN exhibits competitive performance in both reconstruction quality and in computational efficiency. For the Inline graphic SR task, HPEN reaches a PSNR of 38.28dB on the IR700-test dataset and 39.83dB on the resultsA dataset, outperforming the strong baseline HiT-SR²⁶ by 0.03% and 0.05%, respectively. Its advantage is more pronounced in the more challenging task, where HPEN attains 40.88dB on the CVC-09 dataset and 34.34dB on the my_test dataset, outperforming HiT-SR²⁶ and CATANet²⁵ by 0.05% and 0.03%. Notably, these superior results are achieved with significantly lower computational cost. In the Inline graphic setting, HPEN requires only 44G FLOPs, compared to 77G for HiT-SR²⁶ and 64G for CATANet²⁵, while still achieving a 0.06% higher PSNR on the IR700-test dataset. It is worth noting that while the LIRSRN⁴³ maintains an extremely low parameter count and computational complexity (70K parameters and 5G FLOPs), it exhibits average performance degradation of 1.04dB in PSNR and 0.0102 in SSIM compared to HPEN, with the PSNR reduction reaching 4.6% specifically on the CVC-09 dataset. Table 2 also demonstrates that, excluding LIRSRN⁴³, PSNR variations among all evaluated methods are confined within 0.35dB, with over 75% of the differences remaining below 0.10dB, making the 1.04dB performance gap particularly notable.

Additionally, we compare GPU memory usage and inference time of between the proposed method and mainstream methods, including RFDN³⁰, ShuffleMixer³³, RLKA-Net⁴², HNCT⁵⁹, CATANet²⁵ and HiT-SR²⁶. As illustrated in Table 3, the proposed HPEN is highly efficient in terms of memory consumption, requiring only 115.98M, which represents a substantial reduction of approximately 90.58% and 83.54% compared to the most memory-intensive models, HiT-SR²⁶ (1231.27M) and CATANet²⁵ (704.38M), respectively. In terms of inference time, HPEN takes only 44.63ms, which is about 59.97% faster than CATANet (114.49ms) and 55.84% faster than HiT-SR (101.06ms). Although it ranks as the third fastest method behind RFDN and ShuffleMixer, its inference time remains highly competitive. Together with its leading PSNR/SSIM performance (Table 3), these results demonstrate that HPEN achieves a more favorable performance and efficiency trade-off than existing approaches. A visual summary of GPU memory and inference time across all methods is provided in Figure 1.

Table 3.

A comparison of computational efficiency for different Inline graphic SR methods. The analysis compares average inference time (#Avg.Time) and GPU memory consumption (#GPU Mem) across 50 test images with pixel resolution. All measurements were conducted under consistent hardware and software configurations to ensure comparative fairness.

Methods	RFDN³⁰	ShuffleMixer³³	RLKA-Net⁴²	HNCT⁵⁹	CATANet²⁵	HiT-SR²⁶	HPEN(Ours)
#GPU Mem. [M]	59.97	201.39	46.12	116.04	704.38	1231.27	115.98
#Avg.Time [ms]	15.11	19.58	53.65	59.14	114.49	101.06	44.63

Open in a new tab

To further assess the perceptual quality of the reconstructed images, we evaluate all approaches using learned perceptual image patch similarity (LPIPS)⁶¹ metric under a unified, pre-trained AlexNet⁶² backbone to ensure a fair comparison. As reported in Table 4, the proposed HPEN achieves the lowest LPIPS scores across all test sets, indicating superior perceptual fidelity. Notably, on the CVC-09 dataset, HPEN attains a score of 0.0296, significantly outperforming other lightweight approaches. The stacked perceptual-loss comparison in Fig. 3 further confirms that HPEN consistently yields the smallest perceptual deviation. These results demonstrate that HPEN better preserves perceptually important details in infrared image SR, generating outputs that are perceptually closer to the original HR images.

Table 4.

A comparison of LPIPS values across different methods based on the same pre-trained network for Inline graphic SR results.

Model	resultsA	resultsC	CVC-09	IR700-test	M3FD	my_test
CARN³¹	0.1252	0.1187	0.0613	0.1764	0.1509	0.1304
ShuffleMixer_tiny³³	0.1142	0.1114	0.0486	0.1625	0.1420	0.1246
SAFMN¹⁴	0.1114	0.1109	0.0401	0.1594	0.1389	0.1214
LIRSRN⁴³	0.1235	0.1242	0.0406	0.1691	0.1482	0.1294
HiT-SR²⁶	0.1016	0.1040	0.0299	0.1552	0.1338	0.1176
HPEN(Ours)³³	0.1015	0.1036	0.0296	0.1547	0.1331	0.1168

Open in a new tab

Fig. 3 — Comprehensive comparison of LPIPS values between the proposed method and other lightweight approaches for SR results across all test datasets.

Qualitative comparisons. Figure 4 presents a visual comparison of Inline graphic SR results across multiple lightweight methods. The proposed HPEN exhibits excellent performance in both detail restoration and structural preservation. Notably, it effectively mitigates the over-smoothing artifacts commonly observed in outputs from IMDN¹¹, BSRN¹³, ShuffleMixer³³, and SAFMN¹⁴, preserving sharper edge contours and more complex textures. Furthermore, HPEN successfully recovers structural details that are inadequately reconstructed in HNCT⁵⁹ and SMFANet³⁴, which is particularly evident in the restoration of stripe patterns.

Fig. 4 — Visual comparisons of different methods on image001 from the resultsA dataset for SR. The proposed HPEN reconstructs sharper edges and more authentic textures compared to other approaches.

LAM comparisons. Figure 5 compares the local attribute maps (LAMs)⁶³ and corresponding diffusion indices (DIs)⁶³ for image014 from the resultsC dataset. The proposed HPEN shows notably broader activation coverage and achieves a significantly higher DI of 3.25. Its extensive high-response regions (shown in red) visually reflect a stronger ability to integrate contextual information. This result contrasts sharply with the more limited attention range of HiT-SR²⁶ (DI=1.35) and differs considerably from the restricted information utilization observed in other visualized models, all of which attain DI values no greater than 1.00.

Fig. 5 — A comparison of the local attribute maps (LAMs)⁶³ and corresponding diffusion indices (DIs)⁶³ across different methods. The LAM visualizes each pixel’s contribution from the LR input to the final SR output within the marked region, while the DI quantifies the spatial dispersion of influential pixels, with higher values indicating more widespread attention distribution. The LAM visualizations and corresponding DI results substantiate that the proposed HPEN effectively utilizes extensive contextual information during the reconstruction process.

Ablation study

To validate the architectural design of our model, we conduct systematic ablation studies by progressively removing its key components. This allows us to assess the necessity of each part and quantify its contribution to overall performance. The quantitative results on six benchmark datasets, presented in Table 5, clearly illustrate the role each module plays in achieving the reported performance gains.

Table 5.

Comparison of Inline graphic SR performance results for HPEN and its architecture variants across the five test datasets. The FLOPs are measured using input images with pixel resolution. “A B” stands for replacing B with A, and “A None” represents removing operation A. “STL” is swin transformer layer. The baseline HPEN configuration is prominently highlighted in black bold font.

Ablation	Variant	Params[K]	FLOPs[G]	resultsA	resultsC	CVC-09	IR700-test	M3FD	my_test
Baseline	HPEN	445	44	34.93/0.8640	35.58/0.8803	40.88/0.9528	32.16/0.8487	33.25/0.8713	34.34/0.8768
HPEB	TAB None	295	32	0.04/0.0005	0.02/0.0004	0.09/0.0005	0.02/0.0009	0.09/0.0007	0.05/0.0007
	MFEB None	260	20	0.04/0.0006	0.03/0.0006	0.08/0.0004	0.03/0.0010	0.06/0.0004	0.02/0.0004
	Conv None	352	37	0.03/0.0004	0.02/0.0004	0.04/0.0002	0.01/0.0001	0.05/0.0003	0.03/0.0003
	HPEB MFEB then TAB	445	44	0.03/0.0004	0.02/0.0004	0.08/0.0002	0.01/=	0.03/0.0002	0.02/0.0004
	HPEB TAB then TAB	410	33	0.02/0.0005	0.11/0.0005	0.06/0.0002	0.01/0.0006	0.05/0.0003	0.01/0.0004
	HPEB MFEB then MFEB	481	56	0.04/0.0006	0.03/0.0006	0.10/0.0004	0.04/0.0012	0.08/0.0005	0.04/0.0006
TAB	LN None	445	44	0.15/0.0019	0.13/0.0021	0.31/0.0024	0.08/0.0018	0.12/0.0009	0.14/0.0014
	ConvFFN Channel MLP	421	37	0.03/0.0005	0.04/0.0006	0.07/=	0.04/0.0005	0.03/0.0001	0.02/0.0004
	ConvFFN FFN	437	40	0.01/0.0003	0.02/0.0003	0.05/0.0003	0.03/0.0002	0.02/=	0.02/0.0003
MFEB	DWConv None	445	44	0.05/0.0007	0.02/0.0005	0.10/0.0004	0.01/0.0009	0.05/0.0003	0.03/0.0005
	DilaConv None	445	44	0.03/0.0005	0.02/0.0005	0.09/0.0004	0.01/0.0009	0.06/0.0004	0.02/0.0006
	DWConv then Conv	495	51	0.05/0.0008	0.03/0.0007	0.12/0.0005	0.03/0.0011	0.07/0.0003	0.03/0.0006
	DWConv then DWConv	401	41	0.04/0.0008	0.03/0.0008	0.10/0.0005	0.02/0.0013	0.04/0.0004	0.03/0.0007
	DWConv then DilaConv	423	42	0.06/0.0007	0.03/0.0008	0.10/0.0004	0.04/0.0012	0.06/0.0004	0.03/0.0005
	DilaConv then DWConv	423	42	0.04/0.0008	0.03/0.0008	0.10/0.0005	0.02/0.0013	0.04/0.0007	0.03/0.0007
	RTL None	318	25	0.03/0.0005	0.03/0.0005	0.10/0.0004	0.03/0.0011	0.09/0.0006	0.03/0.0004
	RTL STL	564	52	0.02/0.0002	0.01/0.0001	0.06/0.0003	0.02/0.0006	0.03/0.0001	0.02/0.0001

Open in a new tab

Effectiveness of hybrid perception enhancement block. To validate the effectiveness and synergistic interactions among the components of the HPEB module, we conduct comprehensive ablation studies. As summarized in Table 5, the full HPEN delivers optimal performance in the Inline graphic SR task, attaining a PSNR of 32.16dB on the IR700-test dataset. Removing the TAB module (“TAB None”) reduces the parameter count by 33.71% (to 295 K) but leads to notable performance degradation across all datasets; for instance, PSNR drops by 0.09dB (0.22%) on the CVC-09 dataset. Ablating either of the remaining components (“MFEB Inline graphic None” and “Conv None”) also harms model performance. Specifically, removing the MFEB module causes an average PSNR decrease of up to 0.04dB.

Notably, even when the overall model capacity remains unchanged, altering the processing order to MFEB followed by TAB (“HPEB Inline graphic MFEB then TAB”) substantially degrades performance. As shown in Table 5, this change reduces PSNR by 0.08dB (0.20%) on the CVC-09 dataset. Replacing HPEB with a dual-TAB structure (“HPEB TAB then TAB”) saves 7.19% in parameters, but at the cost of 0.11dB (0.31%) lower PSNR on the resultsC dataset, underscoring the essential role of the MFEB in multi-scale feature extraction for preserving infrared image structures. Conversely, a dual-MFEB configuration (“HPEB Inline graphic MFEB then MFEB”) increases parameters by 8.09% without improving any metric; instead, it reduces PSNR on the CVC-09 dataset by 0.10dB. Together, the results in Table 5 validate the critical importance of the complementary TAB and MFEB design in balancing model efficiency and reconstruction performance for infrared image SR.

Token aggregation block. As shown in Table 5, the ablation study confirms the clear superiority of the complete HPEN (Baseline) over all architectural variants. For example, removing LayerNorm (“LN Inline graphic None”) causes a marked decline in both PSNR and SSIM across datasets (e.g., resultsA drops from 34.93/0.8640 to 34.78/0.8621) without reducing parameters or computations, underscoring its essential role in training stability and representational capacity. Ablation results confirm that replacing ConvFFN with a channe MLP or standard FFN reduces parameters by 5.4% and 1.8%, respectively, yet causes notable performance drop (e.g., 0.07dB and 0.05dB PSNR decrease on the CVC-09 dataset). These comparisons clearly show that the complete TAB design in the Baseline is essential for achieving optimal reconstruction performance with high efficiency (445K parameters, 44G FLOPs). This design effectively integrates LayerNorm, dynamic token aggregation, and the convolutional feed-forward network (ConvFFN) into a synergistic unit.

Effectiveness of multi-scale feature enhancement block. To validate the effectiveness of the MFEB, we perform systematic ablation experiments. As reported in Table 5, removing the MFEB module (“MFEB Inline graphic None”) results in an average PSNR degradation of 0.040 dB across all test datasets. Furthermore, to examine the influence of different receptive fields, we evaluate several convolutional combinations. Table 5 shows that using only depth-wise convolution (“DilaConv None”) or only dilated convolution (“DWConv Inline graphic None”) leads to an average PSNR decrease of 0.03dB and 0.04dB across all test datasets, respectively. These results confirm that relying on a single type of feature extraction substantially degrades model performance.

To further assess the efficiency of MFEB’s dual-branch architecture, we conduct extensive comparative experiments. The results demonstrate that while the DWConv-Conv cascade (“DWConv then Conv”) increases parameter count by 11.24% yet it consistently degrades performance across all datasets, reducing PSNR by 0.05dB (0.14%) on the resultsA dataset. Similarly, other cascade configurations lead to considerable performance deterioration. In particular, the DilaConv-DWConv structure (“DilaConv then DWConv”) causes an average PSNR drop of 0.05dB across all evaluation datasets. These findings confirm that the carefully designed parallel branches in the MFEB deliver more efficient feature extraction compared to sequential alternatives.

Furthermore, we introduce the RTL module, which enables cross-region feature interaction through spatial attention mechanisms, thereby effectively modeling long-range dependencies in infrared images. To evaluate its contribution, we perform an ablation study by removing the RTL module (“RTL Inline graphic None”). This modification reduces model parameters by 28.5% compared to the baseline HPEN, but at the cost of performance degradation; the average PSNR and SSIM across five test datasets decreased by 0.044dB and 0.0006, respectively. Furthermore, although replacing RTL with the STL (“RTL Inline graphic STL”) further increases the number of parameters and computational load, its performance is almost at the same level as “RTL None”, indicating that the STL cannot effectively replace the lightweight designed RTL module in this architecture. In contrast, Baseline, with only 44 GFLOPs and 445 K parameters, still achieved a PSNR of 40.88 dB and an SSIM of 0.9528 on the CVC-09 datasets. These results indicate that while the RTL module increases model complexity, it is essential for maintaining high reconstruction quality in infrared SR tasks.

Conclusion

In this paper, we propose a lightweight hybrid perception enhancement network (HPEN) for infrared image super-resolution, whose core innovation is a novel hybrid perception enhancement block (HPEB). The HPEB architecture systematically integrates three complementary components: a Token Aggregation Block (TAB) for modeling long-range dependencies through dynamic feature reorganization, a multi-scale feature enhancement block (MFEB) for capturing fine-grained details via parallel dilated convolutions, and a Inline graphic convolution for feature refinement. This synergistic design enables comprehensive feature learning while maintaining computational efficiency. Extensive experiments show that HPEN achieves state-of-the-art performance among lightweight SR methods, effectively balancing reconstruction quality with practical computational demands. The compelling performance–efficiency trade-off makes HPEN suitable for real-world applications where both accuracy and resource constraints are critical, such as enhancing low-resolution infrared footage in surveillance systems, recovering fine thermal textures for industrial inspection and predictive maintenance, and assisting higher-resolution analysis in medical thermography.

While the proposed HPEN achieves strong performance in infrared image SR, it is important to acknowledge its current limitations: The model is trained exclusively on data synthesized via bicubic downsampling, which may not fully capture the complex and varied degradation patterns (e.g., sensor-specific noise, motion blur, atmospheric effects) present in real-world infrared imaging, thus its generalization capability to such authentic conditions requires further validation. Additionally, although the model is designed to be efficient, its computational and memory requirements could still be challenging for deployment on extremely resource-constrained edge devices with stringent power and latency budgets. Furthermore, the work focuses on scaling factors of Inline graphic and ; its effectiveness on more extreme SR tasks (e.g., and above) remains unexplored and presents a direction for future research.

Acknowledgements

This research was funded by Natural Science Foundation of Xinjiang Uygur Autonomous Region (2022D01C461, 2022D01C460), “Tianshan Talents” Famous Teachers in Education and Teaching project of Xinjiang Uygur Autonomous Region (2025), and the “Tianchi Talents Attraction Project” of Xinjiang Uygur Autonomous Region (2024TCLJ04).

Author contributions

Z.L. main contribution was to propose the methodology of this work and write the paper. J.T. guided the entire research process, participated in writing the paper and secured funding for the research. C.L. participated in the implementation of the algorithm. G.Z. was responsible for algorithm validation and data analysis. All authors have read and agreed to the published version of the manuscript.

Data availability

The datasets used in this study are publicly available. The IR700 dataset can be accessed from the corresponding article. Additional test datasets (results-A and results-C, CVC09-100, IR700-test, M3FD) are also publicly available via their respective cited sources. The implementation code for HPEN is now available at https://github.com/smilenorth1/HPEN-main. Materials availability: The datasets used or analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally to this work: Zepeng Liu and Jiya Tian.

Contributor Information

Zepeng Liu, Email: smilenorth089@163.com.

Guodong Zhang, Email: 2023021@xjit.edu.cn.

References

1.Zhan, Y., Hou, H., Sheng, G., Zhang, Y. & Jiang, X. Infrared image segmentation and temperature monitoring based on yolo model object detection results. In 2024 10th International Conference on Condition Monitoring and Diagnosis (CMD), 456–460, 10.23919/CMD62064.2024.10766200 (2024).
2.Tuerniyazi, A., Lan, J., Zeng, Y., Hu, J. & Zhuo, Y. Multiview angle uav infrared image simulation with segmented model and object detection for traffic surveillance. Sci. Reports15, 1–18. 10.1038/s41598-025-89585-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Wang, S., Du, Y., Zhao, S. & Gan, L. Multi-scale infrared military target detection based on 3x-fpn feature fusion network. IEEE Access11, 141585–141597. 10.1109/ACCESS.2023.3343419 (2023). [Google Scholar]
4.Cui, H., Xu, Y., Zeng, J. & Tang, Z. The methods in infrared thermal imaging diagnosis technology of power equipment. In 2013 IEEE 4th International Conference on Electronics Information and Emergency Communication, 246–251, 10.1109/ICEIEC.2013.6835498 (2013).
5.Wang, Q., Jin, P., Wu, Y., Zhou, L. & Shen, T. Infrared image enhancement: A review. IEEE J. Sel. Top. Appl. Earth Obs.Remote. Sens.18, 3281–3299. 10.1109/JSTARS.2024.3523418 (2025). [Google Scholar]
6.Florio, C. et al. The potential of near infrared (nir) spectroscopy coupled to principal component analysis (pca) for product and tanning process control of innovative leathers. Sci. Reports15, 1–12. 10.1038/s41598-025-17598-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Dong, C., Loy, C. C., He, K. & Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 184–199, 10.1007/978-3-319-10593-2_13 (2014).
8.Dong, C., Loy, C. C. & Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 391–407, 10.1007/978-3-319-46475-6_25 (2016).
9.Kim, J., Lee, J. K. & Lee, K. M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1637–1645, 10.1109/CVPR.2016.181 (2016).
10.Kim, J., Lee, J. K. & Lee, K. M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the conference on Computer Vision and Pattern Recognition, 1646–1654, 10.1109/CVPR.2016.182 (2016).
11.Hui, Z., Gao, X., Yang, Y. & Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th acm international conference on Multimedia, 2024–2032, 10.1145/3343031.3351084 (2019).
12.Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the conference on Computer Vision and Pattern Recognition, 105–114, 10.1109/CVPR.2017.19 (2017).
13.Li, Z. et al. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 833–843, 10.1109/CVPRW56347.2022.00099 (2022).
14.Sun, L., Dong, J., Tang, J. & Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13190–13199, 10.1109/ICCV51070.2023.01213 (2023).
15.Shi, W., Caballero, J., Huszár, F., Totz, J. & Aitken, A. P. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1874–1883, 10.1109/CVPR.2016.207 (2016).
16.Zhang, Y. et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the IEEE/CVF on European Conference on Computer Vision, 286–301, 10.1007/978-3-030-01234-2_18 (2018).
17.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the conference on International Conference on Learning Representations, 10.48550/arXiv.2010.11929 (2021).
18.Chen, H. et al. Pre-trained image processing transformer. In Proceedings of the conference on Computer Vision and Pattern Recognition, 12299–12310, 10.1109/CVPR46437.2021.01212 (2021).
19.Zhou, Y. et al. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12734–12745, 10.1109/ICCV51070.2023.01174 (2023).
20.Liang, J. et al. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1833–1844, 10.1109/ICCVW54120.2021.00210 (2021).
21.Zamir, S. W. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 5728–5739, 10.48550/arXiv.2111.09881 (2022).
22.Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 5998–6008, 10.48550/arXiv.1706.03762 (2017).
23.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).
24.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).
25.Liu, X., Liu, J., Tang, J. & Wu, G. Catanet: Efficient content-aware token aggregation for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 10.48550/arXiv.2503.06896 (2025).
26.Zhang, X., Zhang, Y. & Yu, F. Hit-sr: Hierarchical transformer for efficient image super-resolution. In Proceedings of the conference on European Conference on Computer Vision, 483–500, 10.48550/arXiv.2205.04437 (2024).
27.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the conference on Computer Vision and Pattern Recognition, 770–778, 10.1109/CVPR.2016.90 (2016).
28.Tai, Y., Yang, J. & Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 3147–3155, 10.1109/CVPR.2017.298 (2017).
29.Lim, B., Son, S., Kim, H., Nah, S. & Lee, K. M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the conference on Computer Vision and Pattern Recognition Workshops, 136–144, 10.1109/CVPRW.2017.151 (2017).
30.Liu, J., Tang, J. & Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision workshops, 41–55, 10.48550/arXiv.2009.11551 (2020).
31.Ahn, N., Kang, B. & Sohn, K.-A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 252–268, 10.48550/arXiv.1803.08664 (2018).
32.Hui, Z., Wang, X. & Gao, X. Fast and accurate single image super-resolution via information distillation network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 723–731, 10.1109/CVPR.2018.00082 (2018).
33.Sun, L., Pan, J. & Tang, J. Shufflemixer: An efficient convnet for image super-resolution. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 17314–17326, 10.48550/arXiv.2205.15175 (2022).
34.Zheng, M., Sun, L., Dong, J. & Pan, J. Smfanet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 359–375, 10.1007/978-3-031-72973-7_21 (2025).
35.Chen, X., Wang, X., Zhou, J., Qiao, Y. & Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 22367–22377, 10.48550/arXiv.2205.04437 (2023).
36.Chudasama, V. et al. Therisurnet-a computationally efficient thermal image super-resolution network. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 388–397, 10.1109/CVPRW50498.2020.00051 (2020).
37.Huang, Y., Jiang, Z., Lan, R., Zhang, S. & Pi, K. Infrared image super-resolution via transfer learning and psrgan. IEEE Signal Process. Lett.28, 982–986. 10.1109/LSP.2021.3077801 (2021). [Google Scholar]
38.Marivani, I., Tsiligianni, E., Cornelis, B. & Deligiannis, N. Multimodal deep unfolding for guided image super-resolution. IEEE Transactions on Image Process.29, 8443–8456. 10.1109/TIP.2020.3014729 (2020). [DOI] [PubMed] [Google Scholar]
39.Ying, X. et al. Local motion and contrast priors driven deep network for infrared small target superresolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens.15, 5480–5495. 10.1109/JSTARS.2022.3183230 (2022). [Google Scholar]
40.Liu, S. et al. Infrared image super-resolution via lightweight information split network. Image and Video Process.10.48550/arXiv.2405.10561 (2024). [Google Scholar]
41.Qin, F. et al. Lkformer: large kernel transformer for infrared image super-resolution. Multimed Tools Appl83, 72063–72077. 10.1007/s11042-024-18409-3 (2024). [Google Scholar]
42.Liu, G., Zhou, S., Chen, X., Yue, W. & Ke, J. Recurrent large kernel attention network for efficient single infrared image super-resolution. IEEE Access12, 923–935. 10.1109/ACCESS.2023.3344830 (2024). [Google Scholar]
43.Lin, C.-A., Liu, T.-J. & Liu, K.-H. Lirsrn: A lightweight infrared image super-resolution network. IEEE Int. Symp. on Circuits Syst. (ISCAS) 1–5, 10.1109/ISCAS58744.2024.10558676 (2024).
44.Juhartini, Dwinita, A. & Desmiwati. Single shot multibox detector (ssd) in object detection: A review. Int. J. Adv. Comput. Informatics1, 118–127, 10.71129/ijaci.v1i2.pp118-127 (2025).
45.Ricki, S., Dicky, H. & Bahalwan, A. Fast region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics2, 34–40. 10.71129/ijaci.v2i1 (2025). [Google Scholar]
46.Vivi, A. & Surni, E. Yolov8 for object detection: A comprehensive review of advances, techniques, and applications. Int. J. Adv. Comput. Informatics2, 53–61. 10.71129/ijaci.v2i1 (2025). [Google Scholar]
47.Wang, Y., Li, Y., Wang, G. & Liu, X. Multi-scale attention network for single image super-resolution. J. Vis. Commun. Image Represent.80, 103300. 10.1016/j.jvcir.2021.103300 (2021). [Google Scholar]
48.Wanda, I. & Abubakar, A. A comprehensive review of convnext architecture in image classification: Performance, applications, and prospects. Int. J. Adv. Comput. Informatics2, 108–114. 10.71129/ijaci.v2i2 (2025). [Google Scholar]
49.Surni, E., Vivi, A. & Bahtiar, I. Mask region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics 1, 106–117, 10.71129/ijaci.v1i2.pp106-117 (2025).
50.Ch, R., Shaik, J., Srikavya, R., Sahu, M. & Sahu, A. K. A novel fiestal structured chromatic series-based data security approach. Discov. Internet of Things5, 106. 10.1007/s43926-025-00162-0 (2025). [Google Scholar]
51.Sahu, A. et al. Dual image-based reversible fragile watermarking scheme for tamper detection and localization. Pattern Analysis Appl.26, 571–590. 10.1007/s43926-025-00162-0 (2023). [Google Scholar]
52.Zou, Y. et al. Super-resolution reconstruction of infrared images based on a convolutional neural network with skip connections. Opt. Lasers Eng.146, 106717. 10.1016/j.optlaseng.2021.106717 (2021). [Google Scholar]
53.Liu, Y., Chen, X., Cheng, J., Peng, H. & Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavelets, Multiresolution Inf. Process.16, 1850018. 10.1142/S0219691318500182 (2018). [Google Scholar]
54.Bai, Y. et al. Ibfusion: An infrared and visible image fusion method based on infrared target mask and bimodal feature extraction strategy. IEEE Transactions on Multimed.26, 10610–10622. 10.1109/TMM.2024.3410113 (2024). [Google Scholar]
55.Liu, J. et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5811, 10.48550/arXiv.2203.16220 (2022).
56.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1412.6980 (2015).
57.Loshchilov, I. & Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1608.03983 (2017).
58.Wang, L. et al. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 4915–4924, 10.1109/CVPR46437.2021.00488 (2021).
59.Fang, J., Lin, H., Chen, X. & Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1102–1111, 10.1109/CVPRW56347.2022.00119 (2022).
60.Wu, G., Jiang, J., Jiang, K. & Liu, X. Fully 1 1 convolutional network for lightweight image super-resolution. Mach. Intell. Res. 21, 1–15. 10.1007/s11633-024-1501-9 (2025).
61.Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 586–595, 10.48550/arXiv.1801.03924 (2018).
62.Krizhevsky, A., Sutskever, I. & Hinton, G. E. The unreasonable effectiveness of deep features as a perceptual metric. In Advances in Neural Information Processing Systems10.1145/3065386 (2012).
63.Gu, J. & Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9199–9208, 10.48550/arXiv.2011.11036 (2021).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Zhan, Y., Hou, H., Sheng, G., Zhang, Y. & Jiang, X. Infrared image segmentation and temperature monitoring based on yolo model object detection results. In 2024 10th International Conference on Condition Monitoring and Diagnosis (CMD), 456–460, 10.23919/CMD62064.2024.10766200 (2024).

[CR2] 2.Tuerniyazi, A., Lan, J., Zeng, Y., Hu, J. & Zhuo, Y. Multiview angle uav infrared image simulation with segmented model and object detection for traffic surveillance. Sci. Reports15, 1–18. 10.1038/s41598-025-89585-x (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Wang, S., Du, Y., Zhao, S. & Gan, L. Multi-scale infrared military target detection based on 3x-fpn feature fusion network. IEEE Access11, 141585–141597. 10.1109/ACCESS.2023.3343419 (2023). [Google Scholar]

[CR4] 4.Cui, H., Xu, Y., Zeng, J. & Tang, Z. The methods in infrared thermal imaging diagnosis technology of power equipment. In 2013 IEEE 4th International Conference on Electronics Information and Emergency Communication, 246–251, 10.1109/ICEIEC.2013.6835498 (2013).

[CR5] 5.Wang, Q., Jin, P., Wu, Y., Zhou, L. & Shen, T. Infrared image enhancement: A review. IEEE J. Sel. Top. Appl. Earth Obs.Remote. Sens.18, 3281–3299. 10.1109/JSTARS.2024.3523418 (2025). [Google Scholar]

[CR6] 6.Florio, C. et al. The potential of near infrared (nir) spectroscopy coupled to principal component analysis (pca) for product and tanning process control of innovative leathers. Sci. Reports15, 1–12. 10.1038/s41598-025-17598-7 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Dong, C., Loy, C. C., He, K. & Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 184–199, 10.1007/978-3-319-10593-2_13 (2014).

[CR8] 8.Dong, C., Loy, C. C. & Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 391–407, 10.1007/978-3-319-46475-6_25 (2016).

[CR9] 9.Kim, J., Lee, J. K. & Lee, K. M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1637–1645, 10.1109/CVPR.2016.181 (2016).

[CR10] 10.Kim, J., Lee, J. K. & Lee, K. M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the conference on Computer Vision and Pattern Recognition, 1646–1654, 10.1109/CVPR.2016.182 (2016).

[CR11] 11.Hui, Z., Gao, X., Yang, Y. & Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th acm international conference on Multimedia, 2024–2032, 10.1145/3343031.3351084 (2019).

[CR12] 12.Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the conference on Computer Vision and Pattern Recognition, 105–114, 10.1109/CVPR.2017.19 (2017).

[CR13] 13.Li, Z. et al. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 833–843, 10.1109/CVPRW56347.2022.00099 (2022).

[CR14] 14.Sun, L., Dong, J., Tang, J. & Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13190–13199, 10.1109/ICCV51070.2023.01213 (2023).

[CR15] 15.Shi, W., Caballero, J., Huszár, F., Totz, J. & Aitken, A. P. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1874–1883, 10.1109/CVPR.2016.207 (2016).

[CR16] 16.Zhang, Y. et al. Image super-resolution using very deep residual channel attention networks. In Proceedings of the IEEE/CVF on European Conference on Computer Vision, 286–301, 10.1007/978-3-030-01234-2_18 (2018).

[CR17] 17.Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the conference on International Conference on Learning Representations, 10.48550/arXiv.2010.11929 (2021).

[CR18] 18.Chen, H. et al. Pre-trained image processing transformer. In Proceedings of the conference on Computer Vision and Pattern Recognition, 12299–12310, 10.1109/CVPR46437.2021.01212 (2021).

[CR19] 19.Zhou, Y. et al. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 12734–12745, 10.1109/ICCV51070.2023.01174 (2023).

[CR20] 20.Liang, J. et al. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1833–1844, 10.1109/ICCVW54120.2021.00210 (2021).

[CR21] 21.Zamir, S. W. et al. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 5728–5739, 10.48550/arXiv.2111.09881 (2022).

[CR22] 22.Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 5998–6008, 10.48550/arXiv.1706.03762 (2017).

[CR23] 23.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).

[CR24] 24.Park, N. & Kim, S. How do vision transformers work? In Proceedings of the International Conference on Learning Representations, 10.48550/arXiv.2202.06709 (2022).

[CR25] 25.Liu, X., Liu, J., Tang, J. & Wu, G. Catanet: Efficient content-aware token aggregation for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 10.48550/arXiv.2503.06896 (2025).

[CR26] 26.Zhang, X., Zhang, Y. & Yu, F. Hit-sr: Hierarchical transformer for efficient image super-resolution. In Proceedings of the conference on European Conference on Computer Vision, 483–500, 10.48550/arXiv.2205.04437 (2024).

[CR27] 27.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the conference on Computer Vision and Pattern Recognition, 770–778, 10.1109/CVPR.2016.90 (2016).

[CR28] 28.Tai, Y., Yang, J. & Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 3147–3155, 10.1109/CVPR.2017.298 (2017).

[CR29] 29.Lim, B., Son, S., Kim, H., Nah, S. & Lee, K. M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the conference on Computer Vision and Pattern Recognition Workshops, 136–144, 10.1109/CVPRW.2017.151 (2017).

[CR30] 30.Liu, J., Tang, J. & Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision workshops, 41–55, 10.48550/arXiv.2009.11551 (2020).

[CR31] 31.Ahn, N., Kang, B. & Sohn, K.-A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 252–268, 10.48550/arXiv.1803.08664 (2018).

[CR32] 32.Hui, Z., Wang, X. & Gao, X. Fast and accurate single image super-resolution via information distillation network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 723–731, 10.1109/CVPR.2018.00082 (2018).

[CR33] 33.Sun, L., Pan, J. & Tang, J. Shufflemixer: An efficient convnet for image super-resolution. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 17314–17326, 10.48550/arXiv.2205.15175 (2022).

[CR34] 34.Zheng, M., Sun, L., Dong, J. & Pan, J. Smfanet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the IEEE/CVF European Conference on Computer Vision, 359–375, 10.1007/978-3-031-72973-7_21 (2025).

[CR35] 35.Chen, X., Wang, X., Zhou, J., Qiao, Y. & Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 22367–22377, 10.48550/arXiv.2205.04437 (2023).

[CR36] 36.Chudasama, V. et al. Therisurnet-a computationally efficient thermal image super-resolution network. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 388–397, 10.1109/CVPRW50498.2020.00051 (2020).

[CR37] 37.Huang, Y., Jiang, Z., Lan, R., Zhang, S. & Pi, K. Infrared image super-resolution via transfer learning and psrgan. IEEE Signal Process. Lett.28, 982–986. 10.1109/LSP.2021.3077801 (2021). [Google Scholar]

[CR38] 38.Marivani, I., Tsiligianni, E., Cornelis, B. & Deligiannis, N. Multimodal deep unfolding for guided image super-resolution. IEEE Transactions on Image Process.29, 8443–8456. 10.1109/TIP.2020.3014729 (2020). [DOI] [PubMed] [Google Scholar]

[CR39] 39.Ying, X. et al. Local motion and contrast priors driven deep network for infrared small target superresolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens.15, 5480–5495. 10.1109/JSTARS.2022.3183230 (2022). [Google Scholar]

[CR40] 40.Liu, S. et al. Infrared image super-resolution via lightweight information split network. Image and Video Process.10.48550/arXiv.2405.10561 (2024). [Google Scholar]

[CR41] 41.Qin, F. et al. Lkformer: large kernel transformer for infrared image super-resolution. Multimed Tools Appl83, 72063–72077. 10.1007/s11042-024-18409-3 (2024). [Google Scholar]

[CR42] 42.Liu, G., Zhou, S., Chen, X., Yue, W. & Ke, J. Recurrent large kernel attention network for efficient single infrared image super-resolution. IEEE Access12, 923–935. 10.1109/ACCESS.2023.3344830 (2024). [Google Scholar]

[CR43] 43.Lin, C.-A., Liu, T.-J. & Liu, K.-H. Lirsrn: A lightweight infrared image super-resolution network. IEEE Int. Symp. on Circuits Syst. (ISCAS) 1–5, 10.1109/ISCAS58744.2024.10558676 (2024).

[CR44] 44.Juhartini, Dwinita, A. & Desmiwati. Single shot multibox detector (ssd) in object detection: A review. Int. J. Adv. Comput. Informatics1, 118–127, 10.71129/ijaci.v1i2.pp118-127 (2025).

[CR45] 45.Ricki, S., Dicky, H. & Bahalwan, A. Fast region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics2, 34–40. 10.71129/ijaci.v2i1 (2025). [Google Scholar]

[CR46] 46.Vivi, A. & Surni, E. Yolov8 for object detection: A comprehensive review of advances, techniques, and applications. Int. J. Adv. Comput. Informatics2, 53–61. 10.71129/ijaci.v2i1 (2025). [Google Scholar]

[CR47] 47.Wang, Y., Li, Y., Wang, G. & Liu, X. Multi-scale attention network for single image super-resolution. J. Vis. Commun. Image Represent.80, 103300. 10.1016/j.jvcir.2021.103300 (2021). [Google Scholar]

[CR48] 48.Wanda, I. & Abubakar, A. A comprehensive review of convnext architecture in image classification: Performance, applications, and prospects. Int. J. Adv. Comput. Informatics2, 108–114. 10.71129/ijaci.v2i2 (2025). [Google Scholar]

[CR49] 49.Surni, E., Vivi, A. & Bahtiar, I. Mask region-based convolutional neural network in object detection: A review. Int. J. Adv. Comput. Informatics 1, 106–117, 10.71129/ijaci.v1i2.pp106-117 (2025).

[CR50] 50.Ch, R., Shaik, J., Srikavya, R., Sahu, M. & Sahu, A. K. A novel fiestal structured chromatic series-based data security approach. Discov. Internet of Things5, 106. 10.1007/s43926-025-00162-0 (2025). [Google Scholar]

[CR51] 51.Sahu, A. et al. Dual image-based reversible fragile watermarking scheme for tamper detection and localization. Pattern Analysis Appl.26, 571–590. 10.1007/s43926-025-00162-0 (2023). [Google Scholar]

[CR52] 52.Zou, Y. et al. Super-resolution reconstruction of infrared images based on a convolutional neural network with skip connections. Opt. Lasers Eng.146, 106717. 10.1016/j.optlaseng.2021.106717 (2021). [Google Scholar]

[CR53] 53.Liu, Y., Chen, X., Cheng, J., Peng, H. & Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavelets, Multiresolution Inf. Process.16, 1850018. 10.1142/S0219691318500182 (2018). [Google Scholar]

[CR54] 54.Bai, Y. et al. Ibfusion: An infrared and visible image fusion method based on infrared target mask and bimodal feature extraction strategy. IEEE Transactions on Multimed.26, 10610–10622. 10.1109/TMM.2024.3410113 (2024). [Google Scholar]

[CR55] 55.Liu, J. et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5802–5811, 10.48550/arXiv.2203.16220 (2022).

[CR56] 56.Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1412.6980 (2015).

[CR57] 57.Loshchilov, I. & Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. In Proceedings of the IEEE/CVF International Conference on Learning Representations, 10.48550/arXiv.1608.03983 (2017).

[CR58] 58.Wang, L. et al. Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 4915–4924, 10.1109/CVPR46437.2021.00488 (2021).

[CR59] 59.Fang, J., Lin, H., Chen, X. & Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1102–1111, 10.1109/CVPRW56347.2022.00119 (2022).

[CR60] 60.Wu, G., Jiang, J., Jiang, K. & Liu, X. Fully 1 1 convolutional network for lightweight image super-resolution. Mach. Intell. Res. 21, 1–15. 10.1007/s11633-024-1501-9 (2025).

[CR61] 61.Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 586–595, 10.48550/arXiv.1801.03924 (2018).

[CR62] 62.Krizhevsky, A., Sutskever, I. & Hinton, G. E. The unreasonable effectiveness of deep features as a perceptual metric. In Advances in Neural Information Processing Systems10.1145/3065386 (2012).

[CR63] 63.Gu, J. & Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9199–9208, 10.48550/arXiv.2011.11036 (2021).

PERMALINK

A lightweight hybrid perception enhancement network for infrared image super-resolution

Zepeng Liu

Jiya Tian

Chao Liu

Guodong Zhang

Abstract

Introduction

Fig. 1.

Related works

CNN-based SR methods

ViT-based SR methods

Infrared image SR methods

Proposed method

Fig. 2.

Token aggregation block

Multi-scale feature enhancement block

Restormer layer

Hybrid perception enhancement block

Experimental results

Datasets and implementation details

Comparisons with state-of-the-art methods

Table 2.

Table 1.

Table 3.

Table 4.

Fig. 3.

Fig. 4.

Fig. 5.

Ablation study

Table 5.

Conclusion

Acknowledgements

Author contributions

Data availability

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases