A lightweight multi-scale detection framework for X-ray images with supervised contrastive learning

Qi Diao; WengHowe Chan; Azlan Mohd Zain; Kohbalan Moorthy; YiFan Chen; Huan Tong; YuJie Huo

doi:10.1038/s41598-026-38000-0

. 2026 Feb 25;16:8635. doi: 10.1038/s41598-026-38000-0

A lightweight multi-scale detection framework for X-ray images with supervised contrastive learning

Qi Diao ^1,^2,^✉, WengHowe Chan ^2,^3,^✉, Azlan Mohd Zain ², Kohbalan Moorthy ⁴, YiFan Chen ^2,⁵, Huan Tong ², YuJie Huo ²

PMCID: PMC12979694 PMID: 41735370

Abstract

Automated X-ray security inspection requires object detectors to accurately and efficiently identify prohibited items under challenging conditions, such as dense occlusion, cluttered backgrounds, and limited computational resources. However, existing detectors often struggle with small, overlapping, or heavily occluded objects, which severely limits their real-world applicability. To address these challenges, we propose YOLOv8-DWConv-CSAF, a lightweight and discriminative multi-scale object detection framework specifically designed for X-ray imagery. Our method integrates architectural compression, attention-guided feature enhancement, and contrastive representation learning into a unified detection pipeline. Specifically, all standard convolutions in the YOLOv8 backbone are replaced with depthwise separable convolutions (DWConv), significantly reducing the number of parameters and computational cost while preserving representational capacity. To further enhance detection accuracy, we introduce a novel Channel-Spatial Attention Fusion (CSAF) module that synergistically combines SE and CBAM mechanisms to enrich both channel-wise and spatial feature representations. Additionally, stacked C2f modules facilitate effective multi-scale feature aggregation. To improve localization and class discriminability, we adopt a hybrid loss function that combines Pixel-wise IoU (PIoU) for precise bounding box regression with a supervised InfoNCE contrastive loss to promote intra-class compactness and inter-class separation in the learned feature space. Comprehensive evaluations on both the CLCXray and HiXray benchmarks validate that our model delivers state-of-the-art performance, combining high accuracy with real-time efficiency and robust generalization–making it well-suited for deployment in practical security screening systems.

Keywords: Object detection, DWConv, Attention mechanism, YOLOv8

Subject terms: Engineering, Mathematics and computing

Introduction

X-ray security inspection plays a critical role in ensuring public safety and is extensively deployed in airports, transportation hubs, and key infrastructure to prevent the illegal transportation of dangerous or prohibited items^1,2. The effectiveness of such systems hinges on automated and accurate object detection under complex imaging conditions. Manual inspection is labor-intensive and error-prone, especially in high-throughput environments, thus highlighting the urgent need for fast, reliable, and intelligent vision-based solutions powered by deep learning³.

In recent years, deep learning-based object detection methods have shown remarkable progress in the automated identification of threats in X-ray images. Convolutional Neural Networks (CNNs)⁴ and one-stage detectors such as the YOLO series^5,6 have achieved promising results in terms of both accuracy and inference speed. However, compared to natural images, X-ray imagery poses distinct challenges due to complex object layouts, low color contrast, heavy occlusion, and frequent object overlap⁷. As illustrated in Fig. 1,which presents samples from the sixray_yolov8 dataset⁸ (Roboflow, CC BY 4.0),these factors significantly hinder object localization and classification, leading to increased false alarms and missed detections⁹. Furthermore, prohibited items are often small in size and exhibit subtle inter-class visual differences, compounding the recognition difficulty.

Fig. 1 — Examples of typical prohibited items in X-ray images, taken from the *sixray_yolov8* dataset⁸.

In addition, real-world deployment environments impose strict constraints on detection models. It is essential to design lightweight and computationally efficient architectures that support real-time inference on edge devices^9–11. However, many existing models exhibit high computational complexity and large parameter counts, rendering them unsuitable for resource-constrained settings. More critically, their performance degrades significantly when detecting small-scale or heavily occluded objects, thereby limiting their practical applicability^12,13. Although state-of-the-art frameworks such as Faster R-CNN¹⁴ and the YOLO series (from YOLOv2 to YOLOv12)^5,6 have made notable advancements in detection performance, they still face limitations in the context of X-ray imagery. Many models contain redundant parameters and suffer from inefficiencies that hinder real-time processing. More importantly, they often struggle with precise detection under dense occlusion and clutter. While supervised contrastive learning has recently emerged as a promising approach to improve feature discriminability, its integration into lightweight and real-time detection frameworks for security inspection remains limited and underexplored¹⁵. To overcome the above challenges, we propose YOLOv8-DWConv-CSAF, a novel, lightweight, and robust multi-scale object detection framework tailored for X-ray security inspection. Our main contributions are summarized as follows:

Lightweight backbone design: We replace all standard convolutions in the YOLOv8 backbone with depthwise separable convolutions (DWConv), significantly reducing the model size and computational overhead while retaining strong feature extraction capability.
Enhanced multi-scale feature fusion via CSAF: A newly designed Channel-Spatial Attention Fusion (CSAF) module is integrated with stacked C2f blocks, combining the strengths of SE and CBAM attention mechanisms. This facilitates more effective fusion of multi-scale features, particularly improving detection of small and occluded objects.
Joint optimization with hybrid loss: The detection head is trained using a multi-task objective that combines Pixel Intersection-over-Union (PIoU) loss for accurate localization with a supervised InfoNCE contrastive loss, which leverages hard positive and negative mining to improve feature discriminability.
Real-world deployment readiness: Extensive experiments on public X-ray datasets demonstrate that YOLOv8-DWConv-CSAF achieves superior detection accuracy, lower latency, and enhanced robustness compared to baseline models, making it highly suitable for deployment in practical, resource-constrained security screening scenarios.

Related work

Recent advances in deep learning have significantly accelerated research in both general object detection and X-ray security inspection. In particular, the pursuit of lightweight and efficient detectors has driven innovation in convolutional design, attention mechanisms, and loss functions. This section reviews representative progress in these domains and discusses their relevance to our proposed framework.

Lightweight convolutional architectures

Depthwise separable convolution (DWConv), first introduced in Xception¹⁶, has become a fundamental building block in the design of lightweight CNNs due to its ability to substantially reduce model parameters and computational complexity while preserving representation capacity. Building upon this, Zhang et al.¹⁷ extended DWConv to Vision Transformers, proposing a plug-and-play module that bypasses traditional self-attention layers, thereby accelerating convergence and enhancing fine-grained recognition under limited data and heavy occlusion. YOLO-based models have widely adopted DWConv to improve detection efficiency. For example, Faster-YOLO-AP¹⁸ introduced a parallel DWConv (PDWConv) module to enhance multi-scale feature extraction with fewer parameters, achieving improved performance in orchard detection and edge deployment. Similarly, YOLO-SDL¹⁹ employed a ShuffleNetV2 backbone augmented with DWConv and large separable kernel attention (LSKA), outperforming several YOLO variants on dense agricultural datasets in both accuracy and speed. These developments demonstrate the effectiveness of DWConv in balancing accuracy and efficiency for real-time detection tasks.

Attention mechanisms in detection

Attention mechanisms are increasingly used to improve the selectivity and expressiveness of deep networks. Channel-wise and spatial-wise attention are particularly impactful. The Squeeze-and-Excitation (SE) module²⁰ models inter-channel dependencies via global average pooling and excitation, enabling adaptive recalibration of feature maps. CBAM²¹ complements this with spatial attention, allowing detectors to focus on informative regions through convolutional spatial masks. Several works have integrated attention into YOLO frameworks with promising results. YOLO-RACE²² incorporated a residual CBAM (ResCBAM) module into YOLOv8, significantly enhancing small-object detection on the VisDrone and SKU-110K datasets by combining attention with residual refinement. In another study, Li et al.²³ evaluated various SE-based modules within YOLOv8 for pipeline crack detection, showing that SE modules improved mAP@0.5:0.95 without compromising real-time inference capabilities. Beyond these, DANet²⁴ introduced a dual-attention mechanism combining spatial and channel attention to enrich contextual feature modeling in detection and segmentation tasks. Inspired by these methods, we propose a unified Channel-Spatial Attention Fusion (CSAF) module that integrates SE and CBAM to jointly enhance multi-scale features in a lightweight fashion, tailored for complex X-ray imagery.

Bounding box regression losses

Accurate object localization is critical in detection tasks. Traditional loss functions such as SmoothL1 and GIoU are limited in their sensitivity to object orientation and fine-grained spatial alignment. To address these limitations, PIoU loss²⁵ introduces a pixel-level IoU formulation that jointly optimizes for orientation and pixel-wise overlap, showing superior performance in oriented object detection on the Retail50K benchmark. Although PIoU has shown promise, its application in X-ray scenarios–often characterized by rotated, cluttered, and partially visible targets–remains underexplored. Guan et al.³ conducted a comparative analysis of IoU-based loss variants (e.g., CIoU, DIoU, EIoU, SIoU) under the YOLOv8 framework and demonstrated that PIoU consistently achieved the highest mAP@0.5 on public X-ray datasets. Motivated by these findings, we adopt PIoU as the primary regression loss in our framework to enhance localization precision under complex conditions.

Contrastive learning in object detection

Contrastive learning has emerged as a powerful tool to enhance representation quality by promoting inter-class separation and intra-class compactness in the feature space. InfoNCE²⁶ and its supervised variant SupCon²⁷ have been widely applied in image classification and self-supervised learning. Their application in object detection, however, is relatively recent.

Several works have explored incorporating supervised contrastive loss into detection frameworks to improve robustness to occlusion, class imbalance, and noisy annotations^28–30. Despite their success in natural image domains, few studies have extended contrastive learning to the X-ray inspection domain, where visual similarities among prohibited items and heavy object overlap present unique challenges.

To bridge this gap, we introduce a supervised InfoNCE loss branch into our detection framework. Leveraging hard positive and hard negative mining, this module encourages more discriminative feature embedding, thereby improving classification accuracy and robustness under the challenging conditions typical of X-ray security imagery.

Methodology

YOLOv8, developed by Ultralytics, represents the most recent advancement in the YOLO series and is a state-of-the-art single-stage object detection model³¹. Its network structure is shown in Fig. 2. It inherits the hallmark efficiency and real-time performance of previous YOLO versions while introducing significant enhancements in network architecture, training strategies, and overall detection performance. Specifically, YOLOv8 adopts an anchor-free detection head, dynamic label assignment, and advanced data augmentation techniques, achieving high detection accuracy alongside substantial improvements in inference speed. Moreover, YOLOv8 offers multiple scalable variants, ranging from YOLOv8n to YOLOv8x, which makes it adaptable to a wide range of computational resource constraints across diverse application scenarios. Its backbone employs an improved CSPDarknet architecture, featuring cross-stage partial connections that effectively mitigate gradient vanishing and computational redundancy, thereby enhancing feature extraction efficiency. The feature fusion module utilizes a modified Path Aggregation Network (PANet) to efficiently integrate low-level spatial details with high-level semantic information. Additionally, YOLOv8 introduces a decoupled detection head design that separates classification, bounding box regression, and objectness prediction tasks, reducing task interference and boosting detection accuracy. The incorporation of a task-aligned label assignment strategy (Task-Aligned Assigner) and the distribution focal loss further optimizes positive sample selection and bounding box regression performance. Thanks to its highly modular structure, YOLOv8 allows flexible integration of additional components such as attention mechanisms and lightweight convolution modules, facilitating extensive customization and extension. Consequently, YOLOv8 has found wide application in various computer vision domains³², including natural scene understanding, medical image analysis, remote sensing interpretation, and security inspection tasks, making it an ideal foundational model for our YOLOv8-DWConv-CSAF framework.

Fig. 2 — Architecture of the YOLOv8 network.

Overall architecture overview

As shown in Fig. 3, the proposed framework adopts YOLOv8 as its base, with significant enhancements to the backbone, feature fusion neck, and detection head. Key improvements include the replacement of all standard convolutions with depthwise separable convolutions (DWConv), the introduction of the Channel-Spatial Attention Fusion (CSAF) module, and the integration of multi-scale C2f modules. These innovations collectively enhance feature extraction, discriminability, and computational efficiency.

The proposed framework adopts YOLOv8 as its base, with significant enhancements to the backbone, feature fusion neck, and detection head. The multi-scale feature maps extracted by YOLOv8 are fused through stacked C2f modules and a Channel-Spatial Attention Fusion (CSAF) module. These multi-scale fused features enhance the model’s ability to detect small and occluded objects, ensuring robust performance even in challenging conditions.

The final fused feature map resulting from this multi-scale feature fusion serves as the input for the supervised InfoNCE contrastive loss, which operates on the embeddings from this fused feature map to improve the discriminability of the learned features.

Architecture innovation and improvement principles

To construct a lightweight yet robust detector tailored for X-ray security inspection, we systematically redesign the YOLOv8.0n backbone and neck by introducing architectural enhancements that emphasize efficiency, multi-scale feature fusion, and attention-guided representation learning. The principal improvements are outlined as follows:

Lightweight backbone with DWConv: All convolutional layers in the backbone, including both downsampling and feature transformation stages, are replaced with depthwise separable convolutions (DWConv), resulting in substantial reductions in parameters and FLOPs. Specifically, layers 0, 1, 3, 5, and 7 utilize DWConv for spatial downsampling while preserving important fine-grained information–critical for detecting small and occluded items in X-ray images.
CSAF modules for cross-scale attention: A novel Channel-Spatial Attention Fusion (CSAF) module, inspired by SE and CBAM, is introduced to jointly model channel-wise and spatial dependencies. This attention unit, implemented as a custom NewAttention block in layers 9 and 16, fuses parallel attention branches followed by a convolution, improving the discriminability of features across scales.
Enhanced C2f-based PANet structure: The Path Aggregation Network (PANet) is enhanced with deeper C2f modules at both top-down and bottom-up stages (layers 15, 18, 21, 24, and 27), enabling hierarchical feature integration across P3, P4, and P5. This design fosters robust detection of low-contrast or cluttered targets.
Attention-guided decoding path: To refine the spatial focus during upsampling, CSAF modules are inserted immediately after upsampling operations, helping to suppress background interference and highlight object boundaries–crucial in complex X-ray imagery.
Anchor-free multi-scale detection head: While retaining the anchor-free design of YOLOv8, the detection heads are adjusted to harmonize with the refined feature fusion hierarchy, promoting consistent accuracy across multiple resolutions.

Summary of advantages

The proposed YOLOv8-DWConv-CSAF architecture yields the following benefits:

Reduced complexity: Replacing standard convolutions with DWConv achieves a 25%–40% reduction in FLOPs, enhancing real-time viability.
Improved accuracy: Attention-guided multi-scale fusion significantly boosts detection performance, particularly for small, overlapping, and occluded targets.
Deployment readiness: The compact design ensures suitability for edge devices such as NVIDIA Jetson and embedded GPUs.

An overview of the improved architecture is illustrated in Fig. 3, highlighting the key design components.

Lightweight backbone based on depthwise separable convolutions

To reduce computational cost and model size without sacrificing detection performance, we replace all standard convolutional layers in the YOLOv8 backbone with depthwise separable convolutions (DWConv). This design enables a lightweight yet highly expressive feature extractor, suitable for deployment in real-time X-ray security inspection systems.

Standard convolution in X-ray image feature extraction

Standard convolution is a fundamental building block in convolutional neural networks (CNNs) and plays a critical role in learning hierarchical representations from visual data. In the context of X-ray security screening, where the goal is to identify small and potentially occluded prohibited items (e.g., knives, guns, or batteries), effective feature extraction becomes essential.

Given an input feature map Inline graphic , a standard convolutional layer produces an output by convolving N learnable kernels of size , where C is the number of input channels, D is the kernel spatial dimension, and N is the number of output channels. The total number of parameters is:

The corresponding floating-point operations (FLOPs) to generate the output over spatial dimensions Inline graphic is:

Standard convolution fuses spatial and cross-channel information simultaneously, enabling rich representation learning. However, in X-ray image domains, it exhibits the following drawbacks:

High computational cost: Due to the large number of parameters and FLOPs, standard convolution layers require significant memory and processing time, which limits real-time applicability in edge-deployed inspection systems.
Redundant feature extraction: The fixed-size kernel operates densely across all input channels, often leading to overparameterization and suboptimal efficiency, particularly for shallow layers.
Deployment limitations: The high complexity makes standard convolution unsuitable for embedded and mobile platforms typically used in portable or embedded X-ray scanners.

These issues motivate the search for more efficient alternatives that retain discriminative power while minimizing computational overhead.

Lightweight backbone based on depthwise separable convolutions

Standard convolution forms the backbone of modern convolutional neural networks (CNNs) and has demonstrated strong feature extraction capabilities in natural image recognition tasks. As illustrated in Fig. 4, each output channel is obtained by convolving all input channels with a shared Inline graphic kernel and summing the results. However, when applied to X-ray security inspection imagery, this operation exhibits several inherent limitations that hinder its effectiveness and deployability.

Fig. 4 — Illustration of standard convolution. Each output channel is computed by summing the convolutions over all input channels using shared kernels.

In X-ray images, prohibited items such as knives, guns, or batteries are often small, overlapping, and heavily occluded within cluttered baggage backgrounds. Accurate detection in such conditions demands high-resolution feature preservation and multi-scale spatial reasoning. Unfortunately, standard convolutional layers—while expressive—are not computationally efficient enough for real-time deployment in edge-based security systems.

A typical standard convolution layer with C input channels, N output channels, and a Inline graphic kernel introduces parameters. It requires floating-point operations to process a feature map of size , resulting in:

High latency and power consumption: Real-time X-ray systems demand fast processing for high-throughput screening. Standard convolutions are computationally expensive, causing bottlenecks.
Redundant feature learning: Overparameterization can lead to duplication in local feature extraction, especially in early network layers, reducing parameter efficiency.
Incompatibility with low-power devices: High model complexity limits feasibility on embedded platforms typically used in security checkpoints.

Motivation for depthwise separable convolutions

To address the above limitations, we adopt depthwise separable convolution (DWConv), a lightweight alternative that factorizes standard convolution into two simpler operations:

Depthwise convolution: Applies a spatial filter independently to each input channel.
Pointwise convolution: Uses convolutions to combine feature maps across channels.

Given an input tensor Inline graphic , the output of a depthwise separable convolution is computed as:

The number of parameters and FLOPs are reduced as follows:

Compared to standard convolution:

This significant reduction in computation makes DWConv ideal for lightweight models on resource-constrained platforms.

Visual comparison

Figure 5 illustrates the structural differences between standard convolution and depthwise separable convolution.

Fig. 5 — Visual comparison between standard convolution (left) and depthwise separable convolution (right). DWConv reduces computation by decoupling spatial and channel-wise operations.

Application in X-ray security inspection

Replacing standard convolutions with DWConv in the YOLOv8 backbone provides a favorable trade-off between detection accuracy and computational efficiency. This is crucial for practical X-ray inspection scenarios, where detection speed, memory efficiency, and real-time response are essential. By enabling deeper yet lightweight feature extractors, DWConv enhances performance in identifying small and occluded prohibited items with minimal resource demands.

Channel-Spatial Attention Fusion (CSAF) module

To enhance the detection of small, overlapping, and occluded objects in X-ray imagery, this paper proposes a Channel-Spatial Attention Fusion module (CSAF), as illustrated in Fig. 6. The proposed module effectively integrates the complementary advantages of the channel attention mechanism (SE) and the spatial attention mechanism (CBAM). Unlike conventional attention structures that apply channel and spatial attention in a cascaded or independent manner, CSAF employs a parallel and unified fusion strategy , enabling more comprehensive and discriminative feature recalibration.

Fig. 6 — Overview of the proposed CSAF attention module.

Principle analysis

The core motivation behind the CSAF module is to simultaneously exploit channel-wise and spatial-wise dependencies in feature maps, thereby enabling more precise and robust detection of small, overlapping, and occluded objects–a common challenge in X-ray security imagery.

Traditional channel attention mechanisms, such as Squeeze-and-Excitation (SE), excel at adaptively recalibrating channel responses by learning the importance of each feature channel. This allows the network to focus on semantically informative features but does not explicitly consider the spatial arrangement of objects, which is crucial for localization tasks. In contrast, spatial attention modules (as implemented in CBAM) generate attention maps that highlight salient spatial regions, helping the model localize objects within cluttered or complex backgrounds. However, the typical cascaded structure of SE and CBAM modules often leads to suboptimal feature interaction, as channel and spatial contexts are processed independently or in sequence, potentially missing the synergy between the two.

The proposed CSAF module addresses this limitation by employing a parallel dual-branch structure: the channel attention branch focuses on “what” is meaningful (semantic feature selection), while the spatial attention branch focuses on “where” the important features are located (spatial feature localization). By concatenating the outputs of these two branches and fusing them through a Inline graphic convolution, CSAF enables the network to jointly recalibrate channel and spatial responses in a unified, learnable manner. This fusion allows for richer cross-dimensional interaction and adaptive information integration, strengthening the network’s ability to suppress background noise, amplify subtle cues from small or partially occluded targets, and maintain high discrimination in crowded scenes. Mathematically, this parallel fusion strategy enables the attention mechanism to better model higher-order feature relationships, going beyond simple sequential recalibration. As a result, the detector equipped with CSAF exhibits stronger robustness and generalization in complex conditions.

Channel attention (SE)

Given an input feature map Inline graphic , channel attention is generated by first applying global average pooling to aggregate spatial information for each channel:

The descriptor Inline graphic is passed through a two-layer bottleneck consisting of fully connected layers and non-linearities:

The resulting channel attention map Inline graphic is applied to recalibrate the original feature map:

where Inline graphic denotes element-wise multiplication.

Spatial attention (CBAM)

Spatial attention focuses on identifying salient regions within the feature map. Two spatial descriptors are constructed by average and max pooling across the channel dimension:

These descriptors are concatenated and processed by a convolutional layer, followed by a sigmoid activation to produce the spatial attention map:

which is then applied to the original features:

Fusion strategy

Unlike traditional serial or additive combinations, CSAF adopts a feature-level fusion. The outputs of the channel and spatial attention branches are concatenated and merged via a Inline graphic convolution:

This operation enables adaptive integration of channel- and spatial-enhanced representations, yielding a final feature map Inline graphic with enriched contextual information for subsequent detection heads.

Algorithm: CSAF module (pseudocode)

Summary: The proposed CSAF module provides a unified, lightweight mechanism for simultaneously exploiting channel and spatial dependencies, thereby delivering more robust and discriminative features for X-ray object detection.

Hybrid loss function for detection head optimization

To achieve both accurate localization and robust feature discrimination, we design a hybrid loss function for the detection head in the proposed YOLOv8-DWConv-CSAF framework. This joint loss integrates the Pixel Intersection-over-Union (PIoU) loss for precise bounding box regression and the supervised InfoNCE contrastive loss to enhance feature embedding.

PIoU loss for oriented bounding box regression

The PIoU loss²⁵ is employed to optimize the regression of oriented bounding boxes, especially under conditions of severe occlusion and arbitrary object orientation. The loss is defined as:

where Inline graphic and represent the sets of pixels within the predicted and ground-truth boxes, respectively. This formulation ensures sensitivity to both the position and the shape of objects, resulting in more precise localization.

Supervised InfoNCE contrastive loss for feature embedding

To further improve the discriminability of learned features and enhance detection performance, we introduce a supervised InfoNCE contrastive loss²⁷ applied to the intermediate embeddings of the detection head. In object detection, where an image contains multiple objects and some anchors have no object associated with them, it is crucial to carefully select positive and negative samples.

Anchor matching and sample selection In typical object detection frameworks like YOLO, anchors are pre-defined bounding boxes that match ground-truth objects based on their Intersection-over-Union (IoU) values. For each anchor, the matching criteria are:

Positive samples: Anchors that have a high IoU (typically IoU > 0.5) with ground-truth bounding boxes are labeled as positive samples. These anchors are responsible for detecting objects in the image.
Negative samples: Anchors that have a low IoU (IoU < 0.4) with all ground-truth boxes are considered negative samples. These anchors do not contain any objects and help distinguish between object and background regions.
Ignored samples: Anchors with an IoU between 0.4 and 0.5 can be ignored, as they are neither clearly positive nor negative.

For each anchor, we extract a feature embedding that represents the object or background in that anchor’s region. Contrastive loss is then applied to these feature embeddings.

Loss function for contrastive learning For a mini-batch of N samples, the InfoNCE contrastive loss is formulated as follows:

Here:

is the feature embedding of the anchor (positive or negative sample).
is the feature embedding of a positive sample, which is an anchor matched with a ground-truth object.
are the feature embeddings of K negative samples, which are anchors with no object or those with a low IoU with the ground-truth boxes.
is a temperature parameter that controls the scale of similarity.

Positive and negative sample considerations

Positive pairs: For each positive sample (anchor with a matching ground-truth object), we compute the feature similarity between the anchor and its corresponding object embedding. The InfoNCE loss encourages the anchor’s feature vector to be close to the object’s feature vector in the embedding space.
Negative pairs: For negative samples (anchors with no object), the loss drives the anchor’s feature vector away from the feature vectors of objects from different classes. This ensures that the network learns to distinguish between objects and background regions. Additionally, hard negative mining can be used to prioritize those negative anchors that are most similar to the positive samples, helping the model learn to better differentiate between subtle object-background differences.

Contrastive learning effects The supervised InfoNCE loss encourages:

Intra-class compactness: The feature embeddings of objects within the same class are pulled closer together in the feature space.
Inter-class separability: The feature embeddings of objects from different classes are pushed further apart.

By optimizing this contrastive loss, the model learns more discriminative representations of both objects and background regions, improving its ability to detect objects under challenging conditions such as occlusion, clutter, or small object size.

To further improve the discriminability of learned features and enhance detection performance, we introduce a supervised InfoNCE contrastive loss²⁷ applied to the intermediate embeddings of the detection head. In YOLOv8, multiple feature maps are extracted at different scales, which are essential for detecting objects of various sizes. However, in the context of our proposed YOLOv8-DWConv-CSAF framework, the InfoNCE contrastive loss is applied to the feature embeddings obtained from the final fused feature map of the detection head, after multi-scale feature fusion.

The multi-scale feature maps are fused through the stacked C2f modules and the Channel-Spatial Attention Fusion (CSAF) module. This fusion enables more effective feature extraction and enhances the model’s ability to detect small and occluded objects. The embeddings obtained from this fused feature map are then utilized for the application of InfoNCE loss, which operates on these embeddings to improve intra-class compactness and inter-class separability.

By applying InfoNCE loss to these final embeddings, we encourage the model to learn discriminative features that can better separate objects from the background and improve classification performance, especially in complex detection scenarios like occlusions or small object detection.

Joint optimization objective

The overall training objective is formulated as a weighted sum of the detection and contrastive losses:

where Inline graphic encompasses both the classification and PIoU regression losses, and is a hyperparameter controlling the balance between localization and feature discrimination.

Response: Inline graphic in Equation 16 is a regularization hyperparameter that controls the relative importance of the contrastive loss compared to the standard object detection loss. A higher value of places more emphasis on the contrastive loss, encouraging the model to better separate positive and negative samples in the embedding space. This results in improved feature discriminability, which is beneficial for challenging tasks like detecting small or occluded objects. Conversely, a lower value of Inline graphic reduces the influence of the contrastive loss, allowing the model to prioritize the object detection performance, particularly focusing on bounding box regression and classification accuracy.

This hybrid loss strategy enables the network to jointly learn accurate object localization and robust, semantically meaningful feature embeddings. The contrastive loss improves the model’s ability to distinguish between foreground objects and background noise, which is crucial for enhancing detection performance and generalization in complex X-ray security screening scenarios, where occlusions and overlapping objects are common.

Experiments and evaluation

Experimental setup

All experiments were conducted using the PyTorch framework (v2.0) on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB VRAM), 45 GB of RAM, and a 14-core Intel Xeon Platinum 8362 processor. CUDA version 11.8 and cuDNN version 8.6 were used to accelerate computations under Ubuntu 20.04 LTS with Python 3.9.

The input resolution was fixed at Inline graphic pixels for all images. Unless otherwise stated, models were trained for 150 epochs using stochastic gradient descent (SGD) with a batch size of 32. Data loading was performed using 24 worker threads, and all training procedures were executed on a single GPU (GPU 0).

Dataset description

Experiments were conducted on the Cutters and Liquid Containers X-ray dataset (CLCXray)³³, a dedicated benchmark developed for evaluating object detection models in security inspection scenarios. The dataset consists of 9565 X-ray images, including 4543 real-world samples collected from subway security checkpoints and 5022 synthetically generated samples scanned from manually assembled baggage using X-ray imaging equipment. The dataset can be accessed and downloaded from the following link: https://github.com/GreysonPhoenix/CLCXray.

CLCXray encompasses 12 object categories, grouped into two major hazard classes: five types of cutters (i.e., blade, dagger, knife, scissors, swiss army knife) and seven types of liquid containers (i.e., cans, carton drinks, glass bottle, plastic bottle, vacuum cup, spray cans, and tin). All annotations adhere to the COCO format, with accurate bounding boxes provided for each object instance. The dataset is characterized by high intra-class variation, diverse object scales, frequent occlusions, and complex backgrounds–making it a challenging and representative benchmark for developing robust and generalizable detection models.

To ensure experimental consistency, the dataset was randomly split into training, validation, and test subsets using an 8:2 ratio for training and validation. The test set was kept separate for evaluation and not used during model training.

All images were resized to Inline graphic pixels prior to being fed into the network.

Figure 7 summarizes key statistical properties of the dataset:

Class distribution: Each class contains a sufficient number of labeled instances, with the blade category comprising over 3,000 samples and the plastic bottle category approximately 700.
Overlaid bounding boxes: Visualization of all bounding boxes indicates substantial variability in object shape, size, and spatial arrangement, with dense overlaps and irregular contours.
Object position distribution: A scatter plot of object center points reveals a uniform spatial distribution across the image plane, suggesting minimal positional bias.
Object size distribution: Analysis of bounding box dimensions confirms the presence of multi-scale objects, increasing the detection challenge for small and large items alike.

Fig. 7 — Overview of the CLCXray dataset. (a) Class distribution. (b) Overlaid bounding boxes. (c) Object position distribution. (d) Object size distribution.

Evaluation metrics

To comprehensively assess the effectiveness and deployability of the proposed detection framework, we adopt standard metrics encompassing accuracy, complexity, and inference speed.

Mean Average Precision (mAP)

Detection accuracy is evaluated using the mean Average Precision (mAP), a widely adopted metric in object detection tasks. For each class c, average precision ( Inline graphic ) is computed as the area under the precision–recall curve:

where Inline graphic represents the precision at recall level r.

The overall mAP is calculated as the mean over all C classes:

Following the COCO evaluation protocol, we report two variants:

mAP@0.5: Average precision computed at an IoU threshold of 0.5.
mAP@0.5:0.95: Average precision computed across 10 IoU thresholds from 0.5 to 0.95 with 0.05 increments:

Model complexity

The architectural complexity of each model is evaluated using:

Number of parameters (Params): Total number of trainable parameters, reported in millions (M):
Computational cost (GFLOPs): Floating-point operations per forward pass, reported in gigaFLOPs:

Inference speed

The real-time capability of each model is measured using inference speed, reported as frames per second (FPS). This is computed by averaging the time over 500 forward passes on the test set using a single NVIDIA RTX 3090 GPU.

Together, these metrics provide a comprehensive assessment of detection accuracy, model efficiency, and practical deployability for X-ray security inspection tasks.

Results and analysis

To comprehensively assess the effectiveness, robustness, and practical deployment potential of the proposed detection framework, we conduct a series of systematic experiments, including ablation studies, confusion matrix analysis, and threshold-based performance evaluation. Specifically, the ablation study quantifies the individual contributions of each proposed module to the overall detection performance. The confusion matrix analysis offers detailed insights into class-wise classification behavior and misclassification patterns. Furthermore, the Precision–Confidence curve analysis examines prediction stability across varying confidence thresholds, reflecting the model’s reliability under different decision boundaries.

Ablation study

To quantify the contribution of each proposed module, we perform a stepwise ablation study on the CLCXray dataset. Starting from the baseline YOLOv8-n model, we incrementally introduce the following components: (1) Depthwise Separable Convolutions (DWConv), (2) the Channel-Spatial Attention Fusion (CSAF) module, (3) the Pixel-wise IoU (PIoU) regression loss, and (4) the supervised InfoNCE contrastive loss. The results are summarized in Table 1.

Table 1.

Ablation study of the proposed modules on the CLCXray dataset.

Method	mAP@0.5	mAP@0.5:0.95	Params (M)	GFLOPs
YOLOv8-n (Baseline)	88.4	75.6	3.01	8.1
+ DWConv	88.4	78.3	2.62	7.2
+ DWConv + CSAF	88.0	76.4	1.76	6.5
+ DWConv + CSAF + PIoU	88.3	76.5	1.76	6.5
+ DWConv + CSAF + PIoU + InfoNCE (ours)	88.8	79.1	1.76	6.5

Open in a new tab

The baseline model achieves 88.4% mAP@0.5 and 75.6% mAP@0.5:0.95. Introducing DWConv significantly boosts fine-grained localization performance, increasing mAP@0.5:0.95 to 78.3 while reducing model complexity to 2.62M parameters and 7.2 GFLOPs. This confirms DWConv’s benefit in enhancing accuracy under stricter IoU thresholds while improving efficiency.

Adding CSAF further compresses the model to 1.76M parameters and 6.5 GFLOPs with minimal loss in accuracy, maintaining a competitive mAP@0.5:0.95 of 76.4. However, the slight drop in mAP observed with CSAF can be attributed to its primary role in enhancing attention to spatial and channel dependencies, which does not directly focus on object localization precision but rather on improving feature discriminability. This shift towards improved feature extraction may result in minor performance trade-offs in AP, especially in cases where localization accuracy is highly critical.

Replacing CIoU with PIoU improves localization quality slightly, achieving 76.5% mAP@0.5:0.95. This suggests that PIoU enhances bounding box regression, especially in cases of overlap and occlusion. PIoU better captures pixel-level IoU, which leads to more precise bounding box localization in dense or occluded regions.

Finally, integrating the supervised InfoNCE loss results in the highest overall performance–88.8% mAP@0.5 and 79.1% mAP@0.5:0.95–while maintaining the same model size and FLOPs. The InfoNCE loss further improves feature discriminability by encouraging the separation of positive and negative samples in the embedding space. This leads to a more robust and accurate model, particularly in handling small, occluded, or overlapping objects. The synergy between PIoU and InfoNCE thus boosts both localization precision and feature separation, culminating in the best performance.

In summary, each component contributes to either performance enhancement or model compression. The integration of CSAF, PIoU, and InfoNCE losses optimizes the model for both accurate detection and efficient feature representation. Their combination yields a compact yet highly accurate detection framework, confirming the proposed YOLOv8-DWConv-CSAF model’s effectiveness and practical applicability in real-world X-ray security inspection scenarios.

Confusion matrix analysis

To further examine the classification performance of our model, we visualize and compare the normalized confusion matrices of the baseline YOLOv8n and the improved YOLOv8n-DWConv-CSAF models on the CLCXray dataset, as shown in Figs. 8 and 9. Each matrix entry represents the percentage of instances from a ground-truth class (rows) predicted as a given class (columns), with diagonal elements indicating correct classifications.

As illustrated in Fig. 8, the baseline model exhibits notable confusion between visually similar classes such as scissors and knife, or blade and swiss army knife. Furthermore, it struggles with small or occluded items, resulting in elevated false positive rates.

By contrast, the proposed model in Fig. 9 demonstrates significantly improved class discrimination, as evidenced by stronger diagonal values and reduced off-diagonal noise. For example, classification accuracy increases from 94 to 97% for glass bottle, and from 13 to 16% for plastic bottle. In addition, false positives are reduced in categories such as spray cans, plastic bottle, and tin, highlighting the model’s improved robustness.

Table 2 presents a quantitative comparison of per-class accuracies. Most categories show stable or improved performance, with the most substantial gains occurring in deformable or transparent objects–typically challenging to detect in X-ray imagery.

Table 2.

Per-class accuracy comparison between the baseline and improved YOLOv8n models on the CLCXray dataset.

Class	Baseline	Improved	Accuracy change
blade	0.98	0.97	−0.01
dagger	0.97	0.97	0.00
knife	0.98	0.98	0.00
scissors	1.00	0.99	−0.01
swiss army knife	0.96	0.96	0.00
cans	0.90	0.90	0.00
carton drinks	0.84	0.85	+0.01
glass bottle	0.94	0.97	+0.03
plastic bottle	0.13	0.16	+0.03
vacuum cup	0.96	0.97	+0.01
spray cans	0.86	0.86	0.00
tin	0.72	0.68	−0.04

Open in a new tab

Significant values are in bold.

The observed improvements can be attributed to three core components of our architecture: (1) the CSAF module, which enhances both spatial and semantic attention, especially for occluded or cluttered targets; (2) the supervised InfoNCE loss, which promotes discriminative feature learning and improves class separability; and (3) the PIoU regression loss, which refines localization quality through pixel-level alignment.

Overall, the confusion matrix analysis confirms that the proposed improvements not only yield higher detection metrics but also lead to more reliable and interpretable classification behavior across challenging object categories. Such robustness is critical in high-risk X-ray security scenarios, where accurate identification of ambiguous or densely packed threats is essential for operational safety.

Precision–confidence curve analysis

To evaluate the reliability of detection across varying confidence thresholds, we analyze the Precision–Confidence (P–C) curves of the baseline YOLOv8-n and the improved YOLOv8-DWConv-CSAF models, as shown in Figs. 10 and 11.

As illustrated in Fig. 10, the baseline model exhibits significant fluctuations in precision for several categories, especially for small or occluded objects such as dagger, tin, and plastic bottle. These instabilities at lower confidence levels suggest a higher false positive rate and less reliable classification outcomes.

In contrast, the proposed YOLOv8-DWConv-CSAF model (Fig. 11) yields notably smoother and more stable P–C curves across all object categories. It achieves perfect precision (1.00) at a higher confidence threshold of 0.972, compared to 0.957 for the baseline, indicating a stronger capacity to suppress false positives and enhanced classification confidence.

This improvement is primarily attributed to the integration of Depthwise Separable Convolutions (DWConv), which enhance local feature extraction in shallow layers, and the CSAF attention mechanism, which improves spatial-channel focus. Together, they lead to more discriminative feature representations, benefiting the detection of small-scale or partially occluded targets frequently encountered in X-ray security imagery.

Recall–confidence curve analysis

To further assess detection robustness under varying confidence thresholds, we analyze the Recall–Confidence (R–C) curves of the baseline YOLOv8-n and the improved YOLOv8-DWConv-CSAF models, as illustrated in Figs. 12 and 13.

As shown in Fig. 12, the baseline model maintains high recall for large and visually distinctive objects such as knife, scissors, and blade. However, it suffers from severe drops in recall for smaller or visually ambiguous categories such as plastic bottle and tin, especially as the confidence threshold increases. The average recall curve (bold blue) exhibits a steep decline beyond a confidence level of 0.5, indicating limited robustness under stricter decision boundaries.

In comparison, the improved YOLOv8-DWConv-CSAF model (Fig. 13) demonstrates substantially more stable recall curves across most object categories. The average recall decreases more gradually with increasing confidence, retaining a value of 0.87–compared to 0.80 in the baseline model. Notably, improvements are observed in challenging categories such as carton drinks, spray cans, and glass bottle, where the recall curves remain smoother and consistently higher. Although the recall for plastic bottle remains relatively low–due to extreme occlusion and shape variation–the decline is less abrupt, reflecting improved resilience.

These enhancements can be primarily attributed to the combined effect of the proposed modules: DWConv enhances local feature extraction in early layers; the CSAF module facilitates refined attention to small-scale or overlapping objects; and the supervised InfoNCE loss contributes to better feature discrimination. Together, these components enable a more consistent and reliable recall response across confidence thresholds, which is particularly critical for real-world X-ray security inspection systems where detection sensitivity must be maintained without compromising specificity.

Generalization performance on HiXray dataset

To further assess the generalization capability of the proposed YOLOv8-DWConv-CSAF model, we conduct additional experiments on the HiXray dataset–a widely used public benchmark for X-ray security inspection. HiXray contains over 45,000 annotated X-ray images spanning 14 categories of prohibited items, and features challenging conditions such as heavy occlusion, clutter, and object overlap³⁴. The HiXray dataset can be accessed and downloaded from the following GitHub repository: https://github.com/hixray-author/hixray.

To simulate a domain adaptation scenario under limited supervision, we randomly sample 20% of the HiXray dataset to form a training subset, ensuring balanced category representation. Both the baseline YOLOv8-n and the proposed YOLOv8-DWConv-CSAF models are retrained on this subset using the same training pipeline as adopted for CLCXray. Performance is then evaluated on the official HiXray test set.

Table 3 summarizes the comparative results.

Table 3.

Detection performance comparison on the HiXray dataset.

Model	mAP@0.5	mAP@0.5:0.95	Params (M)	GFLOPs
YOLOv8-n (Baseline)	75.9	45.1	3.00	8.1
YOLOv8-DWConv-CSAF (ours)	75.6	47.2	1.60	6.4

Open in a new tab

Although the proposed model yields a slightly lower mAP@0.5, it surpasses the baseline by 2.1 percentage points in mAP@0.5:0.95, indicating superior localization accuracy under stricter IoU thresholds. Furthermore, it maintains a highly compact architecture with nearly half the parameters and reduced computational complexity, highlighting its efficiency–accuracy trade-off in cross-domain detection tasks.

As shown in Fig. 14, the proposed model achieves robust classification across most classes, with reduced off-diagonal confusion. Particularly notable improvements are observed in visually similar or ambiguous categories, suggesting that the CSAF attention mechanism enhances discriminative feature encoding, while the supervised InfoNCE loss contributes to better inter-class separation.

In summary, the proposed YOLOv8-DWConv-CSAF framework demonstrates consistent performance across both CLCXray and HiXray datasets, validating its generalization capacity and adaptability. These qualities are essential for real-world deployment in diverse security inspection environments, where dataset shift and domain variability are common.

Comparison with existing methods

To further validate the effectiveness of the proposed YOLOv8-DWConv-CSAF framework, we compare its performance against several state-of-the-art object detection models on the CLCXray dataset. The benchmarked models include YOLOv4–YOLOv10, YOLOv5s, YOLOv7, YOLOv8n/s, YOLOv9, RT-DETR, and YOLOv8n-GEMA. All comparative results, except those of our model, are sourced from the recent study by Wang et al.³⁵.

As summarized in Table 4, the proposed model achieves a competitive detection accuracy of 88.8% mAP@0.5, closely matching the best-performing models such as YOLOv9 (89.0%) and YOLOv8n-GEMA (88.9%). Notably, our model significantly outperforms these methods in inference speed, reaching 588.24 FPS while maintaining a lightweight architecture. In particular, compared to YOLOv8n-GEMA, our approach delivers approximately 56% higher inference speed with comparable detection accuracy, demonstrating a superior balance between performance and efficiency.

Table 4.

Comparison of mAP@0.5 and FPS on the CLCXray dataset.

Model	mAP@0.5 (%)	FPS
YOLOv4	72.6	305.8
YOLOv5s	88.3	348.0
YOLOv7	86.5	397.2
YOLOv8n	86.2	783.4
YOLOv8s	88.0	211.1
YOLOv9	89.0	756.6
YOLOv10n	88.0	802.0
RT-DETR	86.8	92.0
YOLOv8n-GEMA	88.9	377.0
Ours	88.8	588.24

Open in a new tab

All comparative results except “Ours” are adapted from³⁵

These results confirm that the proposed YOLOv8-DWConv-CSAF model offers an excellent trade-off between detection accuracy and inference efficiency. Its superior real-time performance, combined with competitive accuracy, makes it highly suitable for deployment in time-critical security inspection systems.

Discussion

The experimental results comprehensively demonstrate that the proposed YOLOv8-DWConv-CSAF framework achieves a compelling balance between detection accuracy, computational efficiency, and generalization capability–particularly in the context of complex X-ray security inspection tasks.

First, the ablation study reveals that each individual component–DWConv, CSAF, PIoU, and supervised InfoNCE–contributes meaningfully to the model’s overall performance. DWConv significantly reduces the parameter count and FLOPs without sacrificing detection accuracy, confirming its effectiveness as a lightweight feature extractor. The CSAF module enhances the model’s ability to focus on small, occluded, or overlapping objects, leading to improved precision under challenging visual conditions. The integration of PIoU and InfoNCE further refines bounding box localization and inter-class separability, especially under strict IoU thresholds, as evidenced by the increased mAP@0.5:0.95.

Second, confusion matrix and curve-based analyses demonstrate that our improvements not only yield numerical gains but also enhance model robustness across confidence thresholds. The smoother Precision–Confidence and Recall–Confidence curves indicate greater stability in classification outcomes, while the reduced inter-class confusion reflects improved discriminative capacity–crucial in scenarios where item ambiguity and occlusion are prevalent.

Third, generalization experiments on the HiXray dataset validate the model’s strong cross-domain transferability. Despite being trained on a different dataset with limited supervision, our model maintains high performance while preserving its lightweight design. This suggests that the proposed architecture is not overfitted to the source domain and can be reliably adapted to new data distributions–an essential characteristic for real-world deployment.

Finally, in comparison with existing state-of-the-art methods, YOLOv8-DWConv-CSAF achieves competitive or superior accuracy while offering significantly faster inference speeds. This is particularly important for real-time applications, where decision latency can directly impact operational safety and throughput.

In summary, the proposed model demonstrates clear advantages in terms of efficiency, accuracy, and generalization. Future work may explore extending this framework to multi-view or multi-modal X-ray scenarios, incorporating domain adaptation strategies, or applying the architecture to broader object detection domains beyond security inspection.

Conclusion

In this study, we propose YOLOv8-DWConv-CSAF, a lightweight and efficient object detection framework tailored for X-ray security inspection tasks. The architecture incorporates Depthwise Separable Convolutions (DWConv), a novel Channel-Spatial Attention Fusion (CSAF) module, PIoU regression loss, and a supervised InfoNCE contrastive loss. These components jointly enhance feature extraction, improve detection of small and occluded objects, and promote discriminative representation learning, all while maintaining a compact model design.Extensive experiments conducted on the CLCXray dataset demonstrate that our model outperforms the baseline YOLOv8n in both detection accuracy and efficiency. Ablation studies validate the individual and combined benefits of the proposed modules, while confusion matrix and confidence-based curve analyses reveal improved robustness and class-level discrimination. Furthermore, the model achieves strong generalization performance on the HiXray dataset under a domain-adaptive setting, confirming its transferability to diverse real-world conditions.Compared with existing state-of-the-art detectors, YOLOv8-DWConv-CSAF offers a favorable trade-off between accuracy and real-time inference speed, achieving 588.24 FPS while preserving high mAP. This makes it particularly suitable for practical deployment in high-throughput, safety-critical inspection systems.

In future work, we plan to explore integrating domain adaptation techniques, multi-modal fusion, and active learning to further improve generalization and adaptivity in varied security scenarios.

Author contributions

Qi Diao conceived the study, designed the methodology, conducted the experiments, and wrote the original draft. WengHowe Chan contributed to conceptualization, supervision, and critical revision of the manuscript. Azlan Mohd Zain provided methodological guidance and participated in manuscript review and editing. Kohbalan Moorthy contributed to data analysis and validation. YiFan Chen assisted with data curation and experimental implementation. Huan Tong and YuJie Huo contributed to data preprocessing and experimental support. All authors read and approved the final manuscript.

Funding

This research was funded by the Scientific Research Fund of the Zhejiang Provincial Education Department, China, under Grant Number Y202456182.

Data availability

The datasets analysed during the current study are publicly available and can be accessed for academic research purposes only. All datasets are appropriately cited in the manuscript. The CLCXray dataset is available at https://github.com/GreysonPhoenix/CLCXray. The dataset is restricted to non-commercial academic use. Copyright © Visual and Intelligent Learning Lab, Tongji University. All rights reserved. The HiXray dataset is available at https://github.com/HiXray-author/HiXray. The dataset is intended for academic research use only, and commercial use is prohibited. Copyright © State Key Laboratory of Software Development Environment, Beihang University (BUAA-SKLSDE), and the original image provider. All rights reserved. Redistribution without permission is not allowed.

Code availability

The code used to support the findings of this study is available from the corresponding author upon reasonable request. The code is intended for academic research purposes only.

Declarations

Competing interests

The authors declare no competing interests.

Ethics approval

This study does not involve human participants or animal experiments.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Qi Diao, Email: diaoqi@graduate.utm.my.

WengHowe Chan, Email: cwenghowe@utm.my.

References

1.Fang, Y., Xu, C. & Zhang, Y. Research on X-ray security contraband identification technology based on lightweight yolov8. Sci. Rep.14(1), 25031 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Kumar, C.S., Vyshnavi, L., Charitha, T., Reddy, V.P.T., Reddy, V.N. & Reddy, B.C.S. Improving threat detection in airport security inspections with X-ray image enhancement. In: 2025 International Conference on Computational, Communication and Information Technology (ICCCIT) 325–330 (2025). 10.1109/ICCCIT62592.2025.10928163.
3.Guan, F., Zhang, H. & Wang, X. An improved yolov8 model for prohibited item detection with deformable convolution and dynamic head. J. Real Time Image Process.22(2), 84 (2025). [Google Scholar]
4.Cha, Y.-J., Choi, W. & Büyüköztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput. Aided Civil Infrastruct. Eng.32(5), 361–378 (2017). [Google Scholar]
5.Hussain, M. Yolov1 to v8: Unveiling each variant-a comprehensive review of yolo. IEEE Access12, 42816–42833 (2024). [Google Scholar]
6.Khanam, R. & Hussain, M. A review of yolov12: Attention-based enhancements vs. previous versions. arXiv preprint arXiv:2504.11995 (2025).
7.Chen, M., Zhang, Z., Jiang, N., Li, X. & Zhang, X. Yolo-srw: An enhanced yolo algorithm for detecting prohibited items in x-ray security images. IEEE Access13, 68323–68339. 10.1109/ACCESS.2025.3560840 (2025). [Google Scholar]
8.User, R. sixray_yolov8 Dataset (v964). Roboflow Universe. Exported via Roboflow.com on January 19, 2024. The dataset contains 8263 X-ray images annotated with dangerous goods in YOLOv8 format. (2024). https://universe.roboflow.com/mpu-project/sixray_yolov8 Accessed 2025–07-04.
9.Zou, Z., Chen, K., Shi, Z., Guo, Y. & Ye, J. Object detection in 20 years: A survey. Proc. IEEE111(3), 257–276. 10.1109/JPROC.2023.3238524 (2023). [Google Scholar]
10.Nayak, R., Pati, U. C., Das, S. K. & Sahoo, G. K. Yolo-gtwdnet: A lightweight yolov8 network with ghostnet backbone and transformer neck to detect handheld weapons for smart city applications. Signal Image Video Process.18(11), 8159–8167 (2024). [Google Scholar]
11.Mittal, P. A comprehensive survey of deep learning-based lightweight object detection models for edge devices. Artif. Intell. Rev.57(9), 242 (2024). [Google Scholar]
12.Huang, R., Pedoeem, J. & Chen, C. Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers. In 2018 IEEE International Conference on Big Data (big Data) 2503–2510 (IEEE, 2018).
13.Fang, W., Wang, L. & Ren, P. Tinier-yolo: A real-time object detection method for constrained environments. IEEE Access8, 1935–1944 (2019). [Google Scholar]
14.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39(6), 1137–1149 (2016). [DOI] [PubMed] [Google Scholar]
15.Diwan, T., Anirudh, G. & Tembhurne, J. V. Object detection using yolo: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl.82(6), 9243–9275 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (2017).
17.Zhang, T., Xu, W., Luo, B. & Wang, G. Depth-wise convolutions in vision transformers for efficient training on small datasets. Neurocomputing617, 128998. 10.1016/j.neucom.2024.128998. (2025). [Google Scholar]
18.Liu, Z., Abeyrathna, R. R. D., Sampurno, R. M., Nakaguchi, V. M. & Ahamed, T. Faster-yolo-ap: A lightweight apple detection algorithm based on improved yolov8 with a new efficient pdwconv in orchard. Comput. Electron. Agric.223, 109118 (2024). [Google Scholar]
19.Qiu, Z. et al. Yolo-sdl: A lightweight wheat grain detection technology based on an improved yolov8n model. Front. Plant Sci.15, 1495222 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition 7132–7141 (2018).
21.Woo, S., Park, J., Lee, J.-Y. & Kweon, I.S. Cbam: Convolutional block attention module. In Proc. of the European Conference on Computer Vision (ECCV) 3–19 (2018).
22.Bae, M.-H., Park, S.-W., Park, J., Jung, S.-H. & Sim, C.-B. Yolo-race: Reassembly and convolutional block attention for enhanced dense object detection. Pattern Anal. Appl.28(2), 90 (2025). [Google Scholar]
23.Li, Z., Xiao, L., Shen, M. & Tang, X. A lightweight yolov8-based model with squeeze-and-excitation version 2 for crack detection of pipelines. Appl. Soft Comput.113260 (2025).
24.Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z. & Lu, H. Dual attention network for scene segmentation. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
25.Chen, Z., Chen, K., Lin, W., See, J., Yu, H., Ke, Y. & Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020 Proceedings, Part V 16, 195–211 (Springer, 2020).
26.Biswas, D. & Tešić, J. Domain adaptation with contrastive learning for object detection in satellite imagery. IEEE Trans. Geosci. Remote Sensing (2024).
27.Khosla, P. et al. Supervised contrastive learning. Adv. Neural Inf. Process. Syst.33, 18661–18673 (2020). [Google Scholar]
28.Jiang, Y. et al. Ecc-polypdet: Enhanced centernet with contrastive learning for automatic polyp detection. IEEE J. Biomed. Health Inform.28(8), 4785–4796 (2023). [DOI] [PubMed] [Google Scholar]
29.Tu, X., He, Z., Fu, G., Liu, J., Zhong, M., Zhou, C., Lei, X., Yin, J., Huang, Y. & Wang, Y. Learn discriminative features for small object detection through multi-scale image degradation with contrastive learning. IEICE Transactions on Information and Systems (2024).
30.Zabin, M., Kabir, A. N. B., Kabir, M. K., Choi, H.-J. & Uddin, J. Contrastive self-supervised representation learning framework for metal surface defect detection. J. Big Data10(1), 145 (2023). [Google Scholar]
31.Ultralytics: Ultralytics YOLOv8 Documentation. https://docs.ultralytics.com/zh/models/yolov8/. Accessed: 5 Apr 2025 (2023).
32.Sohan, M., Sai Ram, T. & Reddy, C.V.R. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics 529–545 (Springer, 2024).
33.Zhao, C., Zhu, L., Dou, S., Deng, W. & Wang, L. Detecting overlapped objects in X-ray security imagery by a label-aware mechanism. IEEE Trans. Inf. Forensics Secur.17, 998–1009. 10.1109/TIFS.2022.3154287 (2022). [Google Scholar]
34.Tao, R., Wei, Y., Jiang, X., Li, H., Qin, H., Wang, J., Ma, Y., Zhang, L. & Liu*, X. Towards real-world x-ray security inspection: A high-quality benchmark and lateral inhibition module for prohibited items detection. In IEEE ICCV (2021).
35.Wang, A., Yuan, P., Wu, H., Iwahori, Y. & Liu, Y. Improved yolov8 for dangerous goods detection in X-ray security images. Electronics13(16), 3238 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code used to support the findings of this study is available from the corresponding author upon reasonable request. The code is intended for academic research purposes only.

[CR1] 1.Fang, Y., Xu, C. & Zhang, Y. Research on X-ray security contraband identification technology based on lightweight yolov8. Sci. Rep.14(1), 25031 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR2] 2.Kumar, C.S., Vyshnavi, L., Charitha, T., Reddy, V.P.T., Reddy, V.N. & Reddy, B.C.S. Improving threat detection in airport security inspections with X-ray image enhancement. In: 2025 International Conference on Computational, Communication and Information Technology (ICCCIT) 325–330 (2025). 10.1109/ICCCIT62592.2025.10928163.

[CR3] 3.Guan, F., Zhang, H. & Wang, X. An improved yolov8 model for prohibited item detection with deformable convolution and dynamic head. J. Real Time Image Process.22(2), 84 (2025). [Google Scholar]

[CR4] 4.Cha, Y.-J., Choi, W. & Büyüköztürk, O. Deep learning-based crack damage detection using convolutional neural networks. Comput. Aided Civil Infrastruct. Eng.32(5), 361–378 (2017). [Google Scholar]

[CR5] 5.Hussain, M. Yolov1 to v8: Unveiling each variant-a comprehensive review of yolo. IEEE Access12, 42816–42833 (2024). [Google Scholar]

[CR6] 6.Khanam, R. & Hussain, M. A review of yolov12: Attention-based enhancements vs. previous versions. arXiv preprint arXiv:2504.11995 (2025).

[CR7] 7.Chen, M., Zhang, Z., Jiang, N., Li, X. & Zhang, X. Yolo-srw: An enhanced yolo algorithm for detecting prohibited items in x-ray security images. IEEE Access13, 68323–68339. 10.1109/ACCESS.2025.3560840 (2025). [Google Scholar]

[CR8] 8.User, R. sixray_yolov8 Dataset (v964). Roboflow Universe. Exported via Roboflow.com on January 19, 2024. The dataset contains 8263 X-ray images annotated with dangerous goods in YOLOv8 format. (2024). https://universe.roboflow.com/mpu-project/sixray_yolov8 Accessed 2025–07-04.

[CR9] 9.Zou, Z., Chen, K., Shi, Z., Guo, Y. & Ye, J. Object detection in 20 years: A survey. Proc. IEEE111(3), 257–276. 10.1109/JPROC.2023.3238524 (2023). [Google Scholar]

[CR10] 10.Nayak, R., Pati, U. C., Das, S. K. & Sahoo, G. K. Yolo-gtwdnet: A lightweight yolov8 network with ghostnet backbone and transformer neck to detect handheld weapons for smart city applications. Signal Image Video Process.18(11), 8159–8167 (2024). [Google Scholar]

[CR11] 11.Mittal, P. A comprehensive survey of deep learning-based lightweight object detection models for edge devices. Artif. Intell. Rev.57(9), 242 (2024). [Google Scholar]

[CR12] 12.Huang, R., Pedoeem, J. & Chen, C. Yolo-lite: a real-time object detection algorithm optimized for non-gpu computers. In 2018 IEEE International Conference on Big Data (big Data) 2503–2510 (IEEE, 2018).

[CR13] 13.Fang, W., Wang, L. & Ren, P. Tinier-yolo: A real-time object detection method for constrained environments. IEEE Access8, 1935–1944 (2019). [Google Scholar]

[CR14] 14.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39(6), 1137–1149 (2016). [DOI] [PubMed] [Google Scholar]

[CR15] 15.Diwan, T., Anirudh, G. & Tembhurne, J. V. Object detection using yolo: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl.82(6), 9243–9275 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (2017).

[CR17] 17.Zhang, T., Xu, W., Luo, B. & Wang, G. Depth-wise convolutions in vision transformers for efficient training on small datasets. Neurocomputing617, 128998. 10.1016/j.neucom.2024.128998. (2025). [Google Scholar]

[CR18] 18.Liu, Z., Abeyrathna, R. R. D., Sampurno, R. M., Nakaguchi, V. M. & Ahamed, T. Faster-yolo-ap: A lightweight apple detection algorithm based on improved yolov8 with a new efficient pdwconv in orchard. Comput. Electron. Agric.223, 109118 (2024). [Google Scholar]

[CR19] 19.Qiu, Z. et al. Yolo-sdl: A lightweight wheat grain detection technology based on an improved yolov8n model. Front. Plant Sci.15, 1495222 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition 7132–7141 (2018).

[CR21] 21.Woo, S., Park, J., Lee, J.-Y. & Kweon, I.S. Cbam: Convolutional block attention module. In Proc. of the European Conference on Computer Vision (ECCV) 3–19 (2018).

[CR22] 22.Bae, M.-H., Park, S.-W., Park, J., Jung, S.-H. & Sim, C.-B. Yolo-race: Reassembly and convolutional block attention for enhanced dense object detection. Pattern Anal. Appl.28(2), 90 (2025). [Google Scholar]

[CR23] 23.Li, Z., Xiao, L., Shen, M. & Tang, X. A lightweight yolov8-based model with squeeze-and-excitation version 2 for crack detection of pipelines. Appl. Soft Comput.113260 (2025).

[CR24] 24.Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z. & Lu, H. Dual attention network for scene segmentation. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).

[CR25] 25.Chen, Z., Chen, K., Lin, W., See, J., Yu, H., Ke, Y. & Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020 Proceedings, Part V 16, 195–211 (Springer, 2020).

[CR26] 26.Biswas, D. & Tešić, J. Domain adaptation with contrastive learning for object detection in satellite imagery. IEEE Trans. Geosci. Remote Sensing (2024).

[CR27] 27.Khosla, P. et al. Supervised contrastive learning. Adv. Neural Inf. Process. Syst.33, 18661–18673 (2020). [Google Scholar]

[CR28] 28.Jiang, Y. et al. Ecc-polypdet: Enhanced centernet with contrastive learning for automatic polyp detection. IEEE J. Biomed. Health Inform.28(8), 4785–4796 (2023). [DOI] [PubMed] [Google Scholar]

[CR29] 29.Tu, X., He, Z., Fu, G., Liu, J., Zhong, M., Zhou, C., Lei, X., Yin, J., Huang, Y. & Wang, Y. Learn discriminative features for small object detection through multi-scale image degradation with contrastive learning. IEICE Transactions on Information and Systems (2024).

[CR30] 30.Zabin, M., Kabir, A. N. B., Kabir, M. K., Choi, H.-J. & Uddin, J. Contrastive self-supervised representation learning framework for metal surface defect detection. J. Big Data10(1), 145 (2023). [Google Scholar]

[CR31] 31.Ultralytics: Ultralytics YOLOv8 Documentation. https://docs.ultralytics.com/zh/models/yolov8/. Accessed: 5 Apr 2025 (2023).

[CR32] 32.Sohan, M., Sai Ram, T. & Reddy, C.V.R. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics 529–545 (Springer, 2024).

[CR33] 33.Zhao, C., Zhu, L., Dou, S., Deng, W. & Wang, L. Detecting overlapped objects in X-ray security imagery by a label-aware mechanism. IEEE Trans. Inf. Forensics Secur.17, 998–1009. 10.1109/TIFS.2022.3154287 (2022). [Google Scholar]

[CR34] 34.Tao, R., Wei, Y., Jiang, X., Li, H., Qin, H., Wang, J., Ma, Y., Zhang, L. & Liu*, X. Towards real-world x-ray security inspection: A high-quality benchmark and lateral inhibition module for prohibited items detection. In IEEE ICCV (2021).

[CR35] 35.Wang, A., Yuan, P., Wu, H., Iwahori, Y. & Liu, Y. Improved yolov8 for dangerous goods detection in X-ray security images. Electronics13(16), 3238 (2024). [Google Scholar]

PERMALINK

A lightweight multi-scale detection framework for X-ray images with supervised contrastive learning

Qi Diao

WengHowe Chan

Azlan Mohd Zain

Kohbalan Moorthy

YiFan Chen

Huan Tong

YuJie Huo

Abstract

Introduction

Fig. 1.

Related work

Lightweight convolutional architectures

Attention mechanisms in detection

Bounding box regression losses

Contrastive learning in object detection

Methodology

Fig. 2.

Overall architecture overview

Fig. 3.

Architecture innovation and improvement principles

Summary of advantages

Lightweight backbone based on depthwise separable convolutions

Standard convolution in X-ray image feature extraction

Lightweight backbone based on depthwise separable convolutions

Fig. 4.

Motivation for depthwise separable convolutions

Visual comparison

Fig. 5.

Application in X-ray security inspection

Channel-Spatial Attention Fusion (CSAF) module

Fig. 6.

Principle analysis

Channel attention (SE)

Spatial attention (CBAM)

Fusion strategy

Algorithm: CSAF module (pseudocode)

Algorithm 1.

Hybrid loss function for detection head optimization

PIoU loss for oriented bounding box regression

Supervised InfoNCE contrastive loss for feature embedding

Joint optimization objective

Experiments and evaluation

Experimental setup

Dataset description

Fig. 7.

Evaluation metrics

Mean Average Precision (mAP)

Model complexity

Inference speed

Results and analysis

Ablation study

Table 1.

Confusion matrix analysis

Fig. 8.

Fig. 9.

Table 2.

Precision–confidence curve analysis

Fig. 10.

Fig. 11.

Recall–confidence curve analysis

Fig. 12.

Fig. 13.

Generalization performance on HiXray dataset

Table 3.

Fig. 14.

Comparison with existing methods

Table 4.

Discussion

Conclusion

Author contributions

Funding

Data availability

Code availability

Declarations

Competing interests

Ethics approval

Footnotes

Contributor Information