Abstract
In the domain of object detection, small object detection remains a pressing challenge, as existing approaches often suffer from limited accuracy, high model complexity, and difficulty meeting lightweight deployment requirements. In this paper, we propose PCPE-YOLO, a novel object detection algorithm, specifically designed to address these difficulties. First, we put forward a dynamically reconfigurable C2f_PIG module. This module uses a parameter-aware mechanism to adapt its bottleneck structures to different network depths and widths, reducing parameters while maintaining performance. Next, we introduce a Context Anchor Attention mechanism that boosts the model’s focus on the contexts of small objects, thereby improving detection accuracy. In addition, we add a small object detection layer to enhance the model’s localization capability for small objects. Finally, we integrate an Efficient Up-Convolution Block to sharpen decoder feature maps, enhancing small object recall with minimal computational overhead. Experiments on VisDrone2019, KITTI, and NWPU VHR-10 datasets show that PCPE-YOLO significantly outperforms both the baseline and other state-of-the-art methods in precision, recall, mean average precision, and parameters, achieving the best precision among all compared approaches. On VisDrone2019 in particular, it achieves improvements of 3.8% in precision, 5.6% in recall, 6.2% in mAP50, and 5% in F1 score, effectively combining lightweight design with high small object detection performance and providing a more efficient and reliable solution for small object detection in real-world applications.
Keywords: YOLOv8, Object detection, Small object, Lightweight
Subject terms: Mathematics and computing, Computer science
Introduction
Object detection is one of the core tasks in computer vision, with applications in autonomous driving, security surveillance, medical imaging, remote sensing, and many other fields. With the rapid advancement of deep learning—especially following the introduction of algorithms such as YOLO1 and SSD2—the field of object detection has seen remarkable gains in both accuracy and real-time performance. However, while current object detection techniques perform well on medium and large objects, detecting small objects remains a significant challenge. Small objects occupy only a tiny portion of the image, resulting in sparse feature information. In addition, their low contrast with the background makes them more susceptible to interference from lighting variations and environmental noise, often leading to a noticeable decline in detection accuracy. Therefore, small object detection must not only overcome issues related to low resolution and limited features but also address the challenges of complex environmental interference and occlusion. These factors together make small object detection a particularly difficult task, urgently requiring the development of specialized detection algorithms to improve both accuracy and robustness. Meanwhile, the limited computational resources of edge devices impose strict constraints on model size and computational complexity. Under such conditions, existing methods often struggle to balance real-time performance and detection precision for small objects. This issue is especially critical in applications such as autonomous driving, remote sensing image analysis, and medical image diagnosis, where small objects often carry crucial information, and detection precision directly affects the safety, reliability, and effectiveness of the entire system. As a result, designing an improved model with both efficient perceptual capability and strong resistance to background noise—capable of meeting the specific demands of small object detection—remains a pressing research focus.
In recent years, deep learning algorithms have developed rapidly, offering significant improvements over traditional methods in various aspects. Modern object detection approaches based on deep learning can be broadly categorized into two types: two-stage and one-stage methods. Two-stage methods, such as Fast R-CNN3 Faster R-CNN4 and Mask R-CNN5 typically consist of two stages: generating region proposals and then performing classification and localization on these proposals. These approaches often rely on convolutional neural networks to extract features from specific regions, which are then analyzed for object classification and precise localization. In contrast, one-stage methods eliminate the region proposal stage by directly performing classification and bounding box regression on the input image. This category includes Single Shot Multibox Detector (SSD)2 You Only Look Once (YOLO)6 series, and DEtection TRansformer (DETR)7. Since the introduction of YOLO in 2016, the YOLO series has attracted considerable attention. YOLOv11 through YOLOv48–10 focused on improving detection efficiency, while YOLOv511 became widely adopted due to its lightweight design and open-source availability. Subsequent versions from YOLOv612 to YOLOv813,14 further improved model performance. With its balance of real-time speed and high accuracy, YOLOv8 has become one of the mainstream choices in object detection. More recent updates, including YOLOv915 through YOLOv1216–18, have continued to improve in terms of overall performance. The YOLO series is known for its fast inference speed and high detection precision. Its core idea is to treat object detection as a regression problem. The input image is divided into a grid, with each grid cell responsible for detecting one or more objects. This design enables the detection of overlapping and multiple objects while reducing spatial complexity. DETR, on the other hand, is an end-to-end object detection framework that reformulates the detection task as a set prediction problem. It discards traditional components such as region proposal networks and non-maximum suppression. Real-Time DEtection TRansformer (RT-DETR)19 an optimized version of DETR, introduces a hybrid encoder design, IoU-aware query selection mechanism, and dynamically adjustable decoder depth. These enhancements significantly improve inference speed and stability. To further address the challenges of small object detection, many studies in recent years have proposed enhancements in feature extraction, loss functions, and multi-scale learning. A commonly used strategy is to improve feature fusion across multiple scales using structures such as the Feature Pyramid Network (FPN)20 and Path Aggregation Network (PANet)21. FPN adopts a top-down pathway with lateral connections to fetch information from different layers, improving small object detection performance. PANet strengthens the propagation of low-level features to enhance localization accuracy for small objects. In addition, improvements to localization loss functions have also played an important role in small object detection. The Complete Intersection over Union (CIoU)22 loss was introduced to refine bounding-box regression for small objects, yielding more precise localization. Although these methods have somewhat improved small object detection, they still exhibit shortcomings. To boost precision, many approaches increase model complexity and computational cost, making them difficult to deploy on resource-constrained devices.
To achieve a balance between small object detection precision and model lightweighting for practical applications across various domains, research on YOLOv8-based improvements has made notable progress. In terms of lightweight design, many studies have focused on optimizing the backbone architecture. For instance, Wu et al.23 reduced model complexity by improving the ShuffleNetv2 module and integrating it into the YOLOv8 framework. Similarly, Pan et al.24 proposed novel modules such as ESEMB and Faster, which maintained detection efficiency while significantly reducing the number of parameters and computational cost. To address the inherent challenges of small object detection, Lei et al.25 and Wang et al.26 applied multi-scale feature fusion techniques to YOLOv8, enabling hierarchical integration of feature maps at different resolutions. This greatly enhanced the representation capacity for small objects. Additionally, Tahir et al.27 and Bao et al.28 incorporated attention mechanisms into the feature extraction process, allowing the model to adaptively focus on salient regions of small objects. These strategies have been shown to improve object discrimination in complex backgrounds. Im Choi et al.29 proposed a pruning method based on the attention mechanism, which reduces model parameters and computational complexity while maintaining model performance. In terms of application-specific extensions, research has increasingly shifted toward meeting domain-specific requirements. For example, in unmanned aerial vehicle (UAV) scenarios, GCL-YOLO30 achieved significant improvements in small object detection accuracy through its lightweight design and efficient feature fusion structure. In the context of autonomous driving, Liu et al. proposed the YOLOv8-FDD31 model, which enhanced real-time performance by optimizing the detection head and feature extraction modules. Shen et al.32 proposed a lightweight deep convolutional neural network and hybrid attention-based algorithm for instrument indication acquisition, significantly improving the accuracy and timeliness of instrument recognition in complex scenarios. Than et al.33 proposed an improved model based on Deformable DETR, which significantly enhances the detection performance of small and occluded objects. Furthermore, in the context of remote sensing image detection, Nguyen et al.34 proposed a hybrid convolution–Transformer-based model that significantly improves the recognition performance for small and complex objects. Likewise, in maritime object detection tasks, Ha et al.35 introduced the YOLO-SR model, which enhances both the accuracy and robustness of small ship detection under challenging SAR imaging conditions. Shen et al.36 constructed a large-scale dataset of ancient Chinese mural elements and introduced an adaptive random erasing augmentation algorithm, effectively addressing sampling and labeling challenges and enhancing the precision and speed of mural element detection. Shen et al.37 presented a finger vein recognition algorithm based on a lightweight deep convolutional neural network and triplet loss, which significantly improves recognition accuracy and matching efficiency while effectively handling background noise and new category recognition.
Despite these advancements, several key challenges remain. First, detection robustness in dynamic scenes and under low-light conditions still needs to be significantly improved. Second, a universal solution for balancing model lightweighting and detection precision has yet to be established—particularly for resource-constrained environments such as mobile devices38. These open issues point to crucial future research directions, requiring coordinated efforts in both theoretical innovation and engineering practice.
To address the above challenges, this paper proposes a novel C2f structure and improves the YOLOv8 model, aiming to enhance detection efficiency through lightweight design and improved small object detection capabilities. The main contributions of this paper are as follows:
We propose a dynamic, reconfigurable C2f_PIG module that reduces parameters while preserving detection performance.
We introduce a Context Anchor Attention (CAA) mechanism to enhance the model’s ability to capture contextual information around small objects.
We add a small object detection layer to optimize the localization process for small objects and boost overall model performance.
We replace the model’s upsampling module with the Efficient Up-Convolution Block (EUCB), thereby optimizing the upsampling process and boosting the recall rate for small object detection.
Methods
Based on the YOLOv8m, this paper proposes a lightweight model friendly to small object detection, named PCPE-YOLO. All C2f modules in the original model are replaced with C2f_PIG modules, which effectively improve computational efficiency while maintaining the model’s performance and reducing the number of parameters. At the end of the backbone network, a lightweight CAA module is added to enhance the model’s contextual relationship capabilities, thereby achieving precise localization and recognition of small objects. To reduce the occurrence of missed detection of small objects, a small object detection layer is added to the Head, effectively addressing the limitations of the YOLOv8 model in detecting extremely small objects. Finally, all upsampling modules in YOLOv8 are replaced with EUCB modules, allowing better resolution recovery in the neck of the model, thus improving the performance of subsequent feature extraction and detection. The structure of PCPE-YOLO is shown in the Fig. 1a.
Fig. 1.
Structure of PCPE-YOLO, CAA and EUCB.
C2f_PIG module
C2f_PIG
Current research on improving the C2f module in the YOLOv8 model often suffers from limited generalization capability. Most proposed improvements are tailored to specific model variants and cannot be effectively transferred across architectures with varying depths and widths. Specifically, modules that perform well in lightweight models tend to suffer from parameter redundancy when applied to models with a large number of parameters. Conversely, lightweight modules designed for large-scale models often experience significant performance degradation when deployed in compact models. In other words, existing improvements to the C2f module lack generalization across different network depths. To address the coupling problem between model scale and improvement effectiveness, this study innovatively proposes a dynamically reconfigurable C2f module termed C2f_PIG that enables adaptive bottleneck configuration via a parameter-aware mechanism. The module adopts a dual-path heterogeneous bottleneck design: in low-parameter C2f_PIG modules, bottlenecks focused on performance are automatically activated, while in high-parameter C2f_PIG modules, some of these performance bottlenecks are replaced with lightweight alternatives.
In the YOLOv8 structure, the parameter n represents the number of Bottleneck repetitions in the C2f module. This parameter is used to control the number of bottlenecks and thereby control the overall parameter count of the model. Based on this design paradigm, this paper establishes a dual-level dynamic bottleneck switching strategy: when n is below a certain threshold, the C2f module uses Bottleneck structures with relatively more parameters to ensure model performance. This Bottleneck structure is referred to as the high-parameter Bottleneck, which still has fewer parameters than the original Bottleneck. When n exceeds the threshold, part of the high-parameter Bottlenecks in the C2f module are replaced with Bottleneck structures that have fewer parameters to ensure model compactness. These are referred to as low-parameter Bottlenecks. To maintain a reasonable balance in the module, the threshold for n is set to 3—when n is less than or equal to 3, the high-parameter Bottleneck is used, focusing on performance enhancement and retention; when n is greater than 3, the low-parameter Bottleneck is used, focusing on the overall lightweight design of the model.
This paper improves the Bottleneck structure by incorporating Partial convolution (PConv) and Inception depthwise convolution (IDConv) to construct the high-parameter Bottleneck, while adopting GhostBottleneckV2 as the low-parameter Bottleneck. The structures of C2f_PIG, Bottleneck_PI, and GhostBottleneckV2 are shown in Fig. 2.
Fig. 2.
C2f_PIG structure.
Partial convolution
Partial convolution (PConv) is a lightweight convolution operation specifically designed to accelerate neural networks by reducing FLOPs to optimize network speed39. However, merely decreasing FLOPs does not always significantly reduce latency caused by frequent memory access in standard convolution operations. To address this, PConv applies filters to only a small portion of channels while leaving the rest unprocessed. Compared to traditional convolutions, this approach reduces redundant computations and memory access, thereby effectively improving computational efficiency.
![]() |
1 |
![]() |
2 |
![]() |
3 |
![]() |
4 |
Equations (1) and (2) represent the computational cost of standard convolution and PConv, respectively, while Eqs. (3) and (4) describe their memory access requirements. Here, h and w denote the height and width of the input feature map, c is the number of input channels, cp is the number of partial input channels, and k is the kernel size. Under default parameter settings where cp/c = 1/4, the FLOPs of PConv are only 1/16 of those of standard convolution, and its memory access is reduced to 1/4 of that of standard convolution. Therefore, incorporating PConv into the model can effectively reduce both computational cost and memory access, accelerating inference and enabling more efficient use of computational resources on edge devices. Moreover, PConv’s design avoids the excessive channel compression seen in traditional convolutions, which often leads to loss of fine-grained details. This minimizes the degradation of small-object features, allowing subsequent modules to extract features more effectively.
Inception depthwise convolution
To meet the needs of lightweight design and small object detection, we improve the Bottleneck structure using Inception depthwise convolution (IDConv)40. The main idea behind IDConv is to process features through different branches and then concatenate the resulting feature maps. First, a small depthwise convolution kernel is used as one branch. Then, a large kh×kw depthwise convolution kernel is decomposed into two smaller depthwise convolution kernels: kh×1 and 1×kw, which form two additional branches. Finally, the original input is retained as an identity mapping branch. The outputs from these four branches are concatenated to form the final output. This design avoids the use of large kh×kw depthwise convolution kernels, thus addressing the issue of slow execution caused by large square kernels in practice. Specifically, the input X is divided into four groups along the channel dimension, as shown in Eq. (5).
![]() |
5 |
Here, g represents the number of channels allocated to the convolution branches. By setting a ratio rg, the final number of branch channels is determined using g = rgC. The split input is then fed into different parallel branches, as shown in Eqs. (6)–(9).
![]() |
6 |
![]() |
7 |
![]() |
8 |
![]() |
9 |
Among them, ks denotes the kernel size of the small depthwise convolution, with a default value of 3, and kb represents the size of the band-shaped convolution, with a default value of 11. Finally, the outputs of all branches are concatenated, as expressed in Eq. (10).
![]() |
10 |
IDConv offers significant advantages in both lightweight design and small object detection. In terms of lightweighting, IDConv decomposes large convolutional kernels into multiple parallel branches with small kernels and an identity mapping, substantially reducing parameter count, computational complexity, and memory access costs. For small object detection, the parallel branches of IDConv can simultaneously capture local detail features and long-range contextual information, while the identity mapping branch preserves the original features of some channels, helping to avoid the loss of small object information caused by downsampling.
GhostBottleneckV2
The GhostBottleneckV2 structure primarily consists of the Ghost module and the DFC attention mechanism, which together ensure a reduction in parameter count while enhancing the model’s ability to capture pixel-wise dependencies over long spatial distances41. The structure of GhostBottleneckV2 is shown in Fig. 3a.
Fig. 3.
GhostBottleneckV2 structure.
The Ghost module decouples the feature generation mechanism of traditional convolutions and proposes a staged optimization strategy to efficiently leverage the redundancy in feature maps. As illustrated in Fig. 3b, the module extracts features in two cooperative stages: the first stage employs a dimensionality-reducing convolution to generate a small number of intrinsic feature maps that capture core semantic information; the second stage applies multi-path linear transformations to each intrinsic feature, generating a set of spatially correlated Ghost feature maps. These are finally fused with the intrinsic features via channel concatenation. This approach maintains feature expressiveness while significantly reducing computational complexity, making it a key enabler for lightweight models.
The DFC attention mechanism, implemented based on fully connected layers, innovatively decouples and reconstructs traditional fully connected layers along the horizontal and vertical axes, enabling long-range spatial dependency modeling while remaining hardware-friendly. Its structure is shown in Fig. 3c. Specifically, this mechanism uses a dual-path parallel structure: the horizontal fully connected layer focuses on modeling horizontal long-range feature dependencies, while the vertical fully connected layer captures vertical spatial relationships. The resulting features are then fused to form an attention map with global perception capability. This decoupling strategy avoids the high computational cost of traditional self-attention mechanisms and eliminates hardware-unfriendly tensor reshaping operations through a convolutional implementation. It significantly improves computational efficiency and demonstrates refined local-global feature coordination in densely packed small-object regions. In the GhostBottleneckV2 structure, the DFC attention mechanism complements the existing Ghost module through a feature enhancement path. By performing decoupled fully connected operations in the downsampled feature space, the module achieves collaborative optimization of local detail and global semantics at a minimal computational cost.
CAA module
In remote sensing or natural images, small object detection often suffers from weak semantic representation and background interference. Traditional methods based on large-kernel convolutions or dilated convolutions can enlarge the receptive field but tend to introduce excessive parameters or produce sparse feature representations. To maintain accuracy in small object detection while achieving lightweight design, this study integrates the Context Anchor Attention (CAA)42 module, whose structure is shown in Fig. 1b. This mechanism is designed to capture long-range contextual dependencies through a lightweight design while maintaining computational efficiency, thereby enabling accurate localization and recognition of small objects.
The CAA module consists of three stages: input feature processing, one-dimensional stripe convolution, and adaptive attention refinement.
In the input feature processing stage, the CAA first applies global average pooling
to the input feature map
to aggregate spatial information, followed by a 1 × 1 convolution to project the channels. This step reduces redundant spatial details while preserving critical regional features essential for small object detection. The process is formulated in Eq. (11):
![]() |
11 |
In the one-dimensional stripe convolution stage, to avoid the computational cost of traditional large-kernel convolutions, the CAA decomposes the context modeling process into two depthwise separable one-dimensional stripe convolutions, as shown in Eqs. (12) and (13):
![]() |
12 |
![]() |
13 |
The convolution kernel size
is dynamically adjusted according to the network stage index n, as defined in Eq. (14):
![]() |
14 |
These one-dimensional stripe convolutions, applied separately in horizontal and vertical directions, can explicitly model long-range dependencies. Compared to standard convolutions, they significantly reduce the number of parameters while maintaining an equivalent receptive field. Furthermore, gradually increasing the kernel size with network depth ensures that deeper layers prioritize broader contextual relationships, thereby enhancing the global modeling capability of high-level features.
In the adaptive attention refinement stage, the propagated features are passed through a convolution and a Sigmoid activation function to generate the attention map, as shown in Eq. (15):
![]() |
15 |
Then, using the attention map, the original features are enhanced via element-wise multiplication to produce the output of the CAA module
, as shown in Eq. (16):
![]() |
16 |
where
denotes element-wise multiplication.
This attention mechanism selectively amplifies the discriminative features of small objects while suppressing irrelevant background regions, thereby highlighting key regional features for small object detection, reducing background noise, and preserving local details.
Small object detection layer
To address the limitations of the original YOLOv8 model in detecting extremely small objects, this study introduces an additional detection layer specifically for small objects. The original model employs three detection layers—P5, P4, and P3—where the P5 layer uses a 20 × 20 feature map to detect objects larger than 32 × 32 pixels, the P4 layer uses a 40 × 40 feature map for objects larger than 16 × 16 pixels, and the P3 layer uses an 80 × 80 feature map for objects larger than 8 × 8 pixels. However, in real-world scenarios, distant or small-sized objects often occupy less than 8 × 8 pixels in the image, making it difficult for the original model to detect them effectively, thus resulting in limited effectiveness for small object detection. To further enhance the model’s capability in detecting extremely small regions, this study introduces a P2 feature layer of 160 × 160 feature map, specifically designed to detect objects larger than 4 × 4 pixels, thereby improving the recall rate for small objects, as illustrated in Fig. 1e. This feature layer undergoes only a 2× downsampling, preserving higher resolution and richer small-object feature information. As a result, this detection layer can fully leverage these features to enhance the model’s perception of small objects, significantly improving detection performance and reducing missed detection. To avoid increasing computational costs, this work does not employ higher-resolution detection layer for small object detection. This design maintains a balance between enhancing perception capability for small objects and preserving computational efficiency. The structure of this layer is identical to that of the other detection heads.
EUCB module
To balance feature resolution recovery and computational efficiency in small object detection, this study employs the Efficient Up-Convolution Block (EUCB)43. This module is designed to achieve high-precision multi-scale feature reconstruction with lightweight computation. Traditional upsampling methods often introduce redundant computations when recovering fine details—an issue that is particularly problematic for small objects due to their low pixel occupancy. EUCB significantly improves edge preservation and classification confidence for small objects while reducing computational cost by utilizing depthwise separable convolutions and a staged feature enhancement strategy.
As shown in Fig. 1c, EUCB consists of three core operations: bilinear upsampling, depthwise convolution optimization, and channel dimension adjustment.
In the bilinear upsampling stage, EUCB enlarges the input feature map
to 2Hx2W via bilinear interpolation to obtain a feature map
with initially recovered spatial resolution, as shown in Eq. (17):
![]() |
17 |
In the depthwise convolution optimization stage, EUCB applies depthwise convolution on the input feature map
to extract local spatial features, followed by normalization and ReLU activation to generate the enhanced feature
, as shown in Eq. (18):
![]() |
18 |
Here, depthwise convolution processes features channel-by-channel, reducing the number of parameters from
in standard convolution to
, thereby significantly lowering computational cost.
In the channel dimension adjustment stage, EUCB uses
convolution to map the number of channels to the target dimensionality
, as shown in Eq. (19):
![]() |
19 |
This operation flexibly adapts to the feature fusion needs of different decoding stages without adding spatial computational burden.
The combination of these three steps is represented in Eq. (20).
![]() |
20 |
Experiments
Datasets
This paper utilizes the VisDrone2019 dataset44 KITTI dataset45 and NWPU VHR-10 dataset46 for model training.
The VisDrone2019 dataset is a large-scale UAV vision benchmark dataset collected and released by the AISKYEYE team from the Machine Learning and Data Mining Laboratory at Tianjin University, China. The annotated objects in the dataset are divided into 12 categories: ignored regions, pedestrian, person, bicycle, car, van, truck, tricycle, awning tricycle, bus, motor, and others. The dataset for object detection contains 6,471 training images, 548 validation images, and 3,190 test images. To discard minor information, this paper ignores the “ignored regions” and “others” categories. The validation and test images are mixed and then split evenly to avoid having too few validation samples that could adversely affect model training. Ultimately, the dataset is divided into training, validation, and test sets in a ratio of 8:1:1.
The KITTI dataset was jointly created by the Karlsruhe Institute of Technology in Germany and the Toyota Research Institute. It is currently one of the largest and most widely used benchmark datasets for evaluating computer vision algorithms in autonomous driving scenarios. The KITTI dataset includes nine object categories: Truck, Van, Tram, Car, Person_sitting, Pedestrian, Cyclist, DontCare, and Misc. In this study, a total of 7,481 images were selected. The categories Truck, Van, and Tram were merged into the Car category, and Person_sitting was merged into Pedestrian. The Cyclist category was retained, while the DontCare and Misc categories were ignored. This merging strategy is based on the fact that Truck, Van, Tram, and Car all belong to the category of motor vehicles, while Person_sitting and Pedestrian are both human-related classes. These merged categories share high similarity in terms of visual appearance, motion patterns, and application scenarios. In autonomous driving tasks, the model’s ability to detect objects is often more critical than distinguishing between visually similar classes. This processing method is thus better aligned with the objectives of object detection tasks and helps mitigate the negative effects caused by ambiguous annotation boundaries in the original dataset. Ignoring the DontCare and Misc categories allows the model to focus more effectively on primary targets in autonomous driving scenarios. The dataset was split into training, validation, and testing sets in a ratio of 8:1:1.
The NWPU VHR-10 dataset is a high-resolution remote sensing object detection dataset released in 2014 by Northwestern Polytechnical University in China. It consists of 800 images, including 715 images sourced from Google Earth and 85 images from the Vaihingen dataset. Among them, 650 are positive samples, and 150 are negative samples. The dataset is annotated with 10 geospatial object categories, including airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle. In this paper, the dataset was also divided into training, validation, and testing sets in a ratio of 8:1:1.
Experiment details
The experiments in this study are conducted on a hardware environment consisting of an NVIDIA Tesla V100 SXM3-32GB GPU and a 10 vCPU Intel Xeon Processor, running on Ubuntu 22.04. The software environment includes Python 3.12, PyTorch 2.5.1, and CUDA 12.4.
Since this study involves training five different scales of the YOLOv8 model—YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x—there are slight differences in training parameter configurations. These configurations are summarized in Table 1. To ensure experimental consistency, models of the same scale were trained using identical parameter settings. All other configurations followed the default settings of the original YOLOv8 framework.
Table 1.
Experimental configurations.
| Model | Image size |
Batch size |
Epochs | Workers | Optimiser | Learning rate |
Weight decay | Momentum |
|---|---|---|---|---|---|---|---|---|
| YOLOv8n | 640 | 32 | 150 | 8 | SGD | 0.01 | 0.0005 | 0.9 |
| YOLOv8s | 640 | 32 | 150 | 8 | SGD | 0.01 | 0.0005 | 0.9 |
| YOLOv8m | 640 | 32 | 200 | 8 | SGD | 0.01 | 0.0005 | 0.9 |
| YOLOv8l | 640 | 16 | 200 | 8 | SGD | 0.01 | 0.0005 | 0.9 |
To evaluate the performance of the model proposed in this study, this paper adopts Precision (P), Recall (R), mean Average Precision (mAP), number of Parameters, and the F1-score as evaluation metrics. Among them, Precision, Recall, mAP, and F1-score are used to assess the model’s detection performance, while the number of parameters is used to evaluate the model’s size in terms of parameter count. The corresponding formulas are as follows:
![]() |
21 |
In Eq. (21), TP denotes the number of correctly detected objects, and FP denotes the number of falsely detected objects. Precision (P) represents the proportion of correctly detected objects among all predicted objects.
![]() |
22 |
In Eq. (22), FN represents the number of missed detections. Recall (R) indicates the proportion of correctly detected objects out of the total number of ground truth objects.
![]() |
23 |
![]() |
24 |
In Eq. (23), AP refers to the Average Precision for each class, calculated based on the Precision-Recall (PR) curve, where the vertical axis is Precision (P) and the horizontal axis is Recall (R). The area under the PR curve corresponds to the AP. The mean Average Precision (mAP) is obtained by averaging the APs across all classes, serving as a comprehensive indicator of both precision and recall. mAP50 is the mAP value calculated with an IoU threshold of 0.5, as defined in Eq. (24), where n is the total number of classes with labeled images.
![]() |
25 |
In Eq. (25), the F1-score is calculated as the harmonic mean of Precision and Recall.
Comparative experiment
C2f_PIG module comparison experiment on different YOLOv8 model
This subsection presents experimental validation of the feasibility of the proposed dynamically reconfigurable C2f_PIG module. The results are shown in Table 2, where C2f_PI refers to the C2f module improved using the high-parameter Bottleneck structure, and C2f_G refers to the one improved using the low-parameter Bottleneck structure. In this experiment, all C2f modules in the model were replaced with the respective experimental modules. Overall, replacing all C2f modules with C2f_G significantly reduces the number of parameters, but it also leads to a proportional decline in performance. In contrast, C2f_PI substantially reduces the parameter count without degrading detection accuracy, and even shows performance improvements in the YOLOv8n and YOLOv8m models. Furthermore, C2f_PIG achieves an additional reduction in parameters compared to C2f_PI, with only minimal changes in performance, demonstrating the feasibility of the proposed structure.
Table 2.
Comparison with different scale model.
| Model | P (%) | R (%) | mAP50(%) | Params(M) | F1(%) |
|---|---|---|---|---|---|
| YOLOv8n | 43.2 | 33.0 | 32.4 | 3.01 | 37.4 |
| YOLOv8n + C2f_PI | 43.7 | 32.8 | 32.5 | 2.63 | 37.5 |
| YOLOv8n + C2f_G | 41.8 | 30.5 | 30.3 | 2.13 | 35.3 |
| YOLOv8n + C2f_PIG | 43.7 | 32.8 | 32.5 | 2.63 | 37.5 |
| YOLOv8s | 48.9 | 37.8 | 37.9 | 11.1 | 42.6 |
| YOLOv8s + C2f_PI | 49.8 | 37.2 | 37.9 | 9.58 | 42.6 |
| YOLOv8s + C2f_G | 48.0 | 36.0 | 36.8 | 7.58 | 41.1 |
| YOLOv8s + C2f_PIG | 49.8 | 37.2 | 37.9 | 9.58 | 42.6 |
| YOLOv8m | 51.4 | 40.4 | 40.9 | 25.9 | 45.2 |
| YOLOv8m + C2f_PI | 52.8 | 39.8 | 41.0 | 20.8 | 45.4 |
| YOLOv8m + C2f_G | 51.0 | 38.8 | 39.8 | 14.2 | 44.1 |
| YOLOv8m + C2f_PIG | 52.5 | 39.9 | 40.9 | 19.0 | 45.3 |
| YOLOv8l | 55.3 | 41.2 | 43.2 | 43.6 | 47.2 |
| YOLOv8l + C2f_PI | 53.9 | 42.2 | 43.1 | 33.7 | 47.3 |
| YOLOv8l + C2f_G | 52.6 | 41.1 | 42.2 | 20.9 | 46.1 |
| YOLOv8l + C2f_PIG | 53.3 | 42.4 | 43.1 | 29.0 | 47.2 |
C2f_PIG module comparison experiment
To demonstrate the superiority of the proposed module, this subsection improves the Bottleneck structure of the C2f module using WTConv47 PConv 48 SAConv49 and ODConv50 resulting in the modified modules named C2f_WTConv, C2f_PConv, C2f_SAConv, and C2f_ODConv, respectively. Additionally, this study applies the lightweight neck module g_ghostbottleneck51 to improve the C2f module, naming the resulting structure C2f_g_ghostbottleneck. These modified C2f modules are then used to replace the original C2f modules in YOLOv8m for comparison. The experimental results are presented in Table 3.
Table 3.
Comparison with other modules.
| Model | P (%) | R (%) | mAP50(%) | Params(M) | F1(%) |
|---|---|---|---|---|---|
| YOLOv8m | 51.4 | 40.4 | 40.9 | 25.9 | 45.2 |
| +WTConv | 50.2 | 39.8 | 40.4 | 20.0 | 44.4 |
| +PConv | 52.0 | 40.0 | 40.9 | 23.5 | 45.2 |
| +SAConv | 53.0 | 39.5 | 40.9 | 20.9 | 45.3 |
| +ODConv | 50.2 | 40.2 | 40.6 | 44.6 | 44.6 |
| +g_ghostbottleneck | 51.0 | 39.0 | 39.9 | 14.9 | 44.2 |
| Ours | 52.5 | 39.9 | 40.9 | 19.0 | 45.3 |
C2f_PConv and C2f_SAConv achieve performance levels comparable to those of the original C2f and C2f_PIG, but when considering the parameter count, C2f_PIG clearly demonstrates superior overall performance. The C2f_ODConv model introduces more than twice the number of parameters compared to the C2f_PIG model. However, as its multi-dimensional attention mechanism fails to handle small object features more effectively than the proposed module, C2f_ODConv exhibits inferior performance compared to C2f_PIG. On the other hand, the C2f_g_ghostbottleneck module, which has the fewest parameters, fails to extract features effectively due to its overly limited parameter capacity, resulting in a significant drop in performance compared to other modules. In summary, the proposed C2f_PIG module strikes a well-balanced trade-off between performance and lightweight design, thereby validating its superiority.
Comparison experiment
This subsection conducts comparative experiments between the proposed model and both mainstream models and those from other studies. The experimental results are presented in Table 4. Among them, GS-YOLOv8l and GCL-YOLO are lightweight models specifically optimized for small object detection. Except for a slightly lower precision compared to YOLOv10l, the proposed model outperforms all others across the remaining four metrics. This is mainly attributed to PCPE-YOLO’s higher recall, which leads to the detection of more objects, slightly impacting precision. Due to the VisDrone2019 dataset containing a large number of densely packed and occluded objects, Faster R-CNN—whose region proposal network generates a limited number of candidate regions—suffers from significantly reduced recall, resulting in lower overall performance. GS-YOLOv8l, GCL-YOLOv5l and GCL-YOLOv8l also show considerable performance gap compared to PCPE-YOLO. Furthermore, the more recent versions YOLO11l and YOLO12l fall short of PCPE-YOLO across all evaluation metrics. In conclusion, the experimental results validate the superiority of the proposed PCPE-YOLO model.
Table 4.
Result of comparative experiment.
| Model | P (%) | R (%) | mAP50(%) | Params(M) | F1(%) |
|---|---|---|---|---|---|
| YOLOv8m14 | 51.4 | 40.4 | 40.9 | 25.9 | 45.2 |
| Faster R-CNN4 | 43.3 | 13.6 | 13.5 | 41.4 | 20.7 |
| YOLOv10l16 | 55.7 | 41.7 | 43.8 | 25.8 | 47.7 |
| YOLO11l17 | 53.7 | 42.1 | 43.7 | 25.3 | 47.2 |
| RT-DETR-l19 | 43.2 | 26.4 | 24.2 | 32.8 | 32.8 |
| YOLO12l18 | 54.9 | 42.7 | 43.8 | 26.4 | 48.0 |
| GS-YOLOv8l52 | 54.0 | 40.8 | 42.2 | 31.8 | 46.5 |
| GCL-YOLOv5l30 | 52.1 | 40.7 | 41.5 | 38.5 | 45.7 |
| GCL-YOLOv8l | 52.2 | 41.1 | 41.9 | 25.4 | 46.0 |
| Ours | 55.2 | 46.0 | 47.1 | 19.5 | 50.2 |
The black bold numbers in the table indicate the best results.
Results of PCPE-YOLO comparison experiment on different datasets
The proposed model demonstrates outstanding performance across different datasets, validating its generalization capability. The experimental results are shown in Table 5. On the VisDrone2019 dataset, PCPE-YOLO outperforms the baseline model in all evaluation metrics, achieving a 6.2% improvement in mAP50. Compared with YOLO11m, which has a similar number of parameters, the proposed model improves precision by 1%, recall by 5.1%, mAP50 by 4.4%, and F1 score by 3.6%, confirming its advantages in small object detection and lightweight design. On the remote sensing dataset NWPU VHR-10, although PCPE-YOLO shows a 1% lower recall than YOLO11m, it surpasses YOLO11m in precision, mAP50, and F1 score, indicating better overall performance. In the NWPU VHR-10 dataset, the wide variation in object sizes and the high similarity between some objects and the background lead to a slight decrease in the recall of PCPE-YOLO, which has the small object detection layer. On the autonomous driving dataset KITTI, PCPE-YOLO continues to outperform other models across all metrics. Notably, RT-DETR-l consistently performs the worst on all datasets. This is primarily because RT-DETR-l, based on the Transformer architecture, requires large-scale datasets to achieve optimal performance. However, the three datasets used in this study are relatively small in scale, thus preventing RT-DETR-l from reaching its full potential. Moreover, both GCL-YOLOv5l and GCL-YOLOv8l exhibited substantial performance gaps compared to PCPE-YOLO across all datasets. Furthermore, the magnitude of improvement across various metrics achieved by PCPE-YOLO varies significantly between the three datasets. The most substantial gains are observed on the VisDrone2019 dataset, followed by the KITTI dataset, with the NWPU VHR-10 dataset exhibiting the least improvement. This discrepancy primarily stems from the fact that PCPE-YOLO is specifically designed for small object detection. The VisDrone2019 dataset contains a high density of small targets, and the KITTI dataset also includes a notable proportion of such objects. Consequently, PCPE-YOLO demonstrates considerable performance enhancements over the baseline model on these two datasets. In contrast, the NWPU VHR-10 dataset primarily contains large objects, such as aircraft and airfields. This dataset primarily tests the model’s capability for object detection against complex backgrounds, explaining why the improvements delivered by PCPE-YOLO are less pronounced here.
Table 5.
Comparison on different datasets.
| Dataset | Model | P (%) | R (%) | mAP50(%) | Params(M) | F1(%) |
|---|---|---|---|---|---|---|
| Visdrone2019 | YOLOv8m | 51.4 | 40.4 | 40.9 | 25.9 | 45.2 |
| RT-DETR-l | 43.2 | 26.4 | 24.2 | 32.8 | 32.8 | |
| YOLO11m | 54.2 | 40.9 | 42.7 | 20.0 | 46.6 | |
| GCL-YOLOv5l | 52.1 | 40.7 | 41.5 | 38.5 | 45.7 | |
| GCL-YOLOv8l | 52.2 | 41.1 | 41.9 | 25.4 | 46.0 | |
| Ours | 55.2 | 46.0 | 47.1 | 19.5 | 50.2 | |
| NWPU VHR-10 | YOLOv8m | 89.7 | 84.9 | 91.0 | 25.9 | 87.2 |
| RT-DETR-l | 86.9 | 79.0 | 85.9 | 32.8 | 82.8 | |
| YOLO11m | 90.0 | 85.1 | 91.4 | 20.0 | 87.5 | |
| GCL-YOLOv5l | 84.3 | 80.3 | 85.1 | 38.5 | 82.3 | |
| GCL-YOLOv8l | 89.1 | 84.7 | 89.6 | 25.4 | 86.8 | |
| Ours | 91.9 | 84.1 | 91.7 | 19.5 | 87.8 | |
| KITTI | YOLOv8m | 92.4 | 84.6 | 90.8 | 25.9 | 88.3 |
| RT-DETR-l | 81.7 | 80.2 | 86.8 | 32.8 | 80.9 | |
| YOLO11m | 92.6 | 84.6 | 91.1 | 20.0 | 88.4 | |
| GCL-YOLOv5l | 93.4 | 83.9 | 90.3 | 38.5 | 88.4 | |
| GCL-YOLOv8l | 92.7 | 84.5 | 90.9 | 25.4 | 88.4 | |
| Ours | 94.8 | 86.7 | 93.5 | 19.5 | 90.6 |
The black bold numbers in the table indicate the best results.
Ablation experiment
This subsection conducts ablation experiments on the VisDrone2019 dataset to validate the effectiveness of each proposed module. The experimental results are shown in Table 6. In Table 6, Model 1 represents the original YOLOv8m; Model 2 is the model with an added small object detection layer; Model 3 introduces the CAA module; Model 4 replaces all C2f modules with C2f_PIG modules; and Model 5 incorporates the EUCB upsampling module.
Table 6.
Result of ablation experiment.
| Model | P2 | CAA | C2f_PIG | EUCB | P (%) | R (%) | mAP50(%) | Params(M) | F1(%) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 51.4 | 40.4 | 40.9 | 25.9 | 45.2 | ||||
| 2 | √ | 55.2 | 44.2 | 45.9 | 25.1 | 49.1 | |||
| 3 | √ | √ | 56.2 | 44.3 | 46.3 | 26.0 | 49.5 | ||
| 4 | √ | √ | √ | 56.4 | 44.7 | 46.5 | 19.0 | 49.9 | |
| 5 | √ | √ | √ | √ | 55.2 | 46.0 | 47.1 | 19.5 | 50.2 |
The results show that adding the P2 small object detection layer makes the model better suited for small object detection tasks, yielding improved performance. After introducing the CAA module, the model parameters increase slightly, but the model becomes more focused on small object regions, and all performance metrics improve. Replacing all C2f modules with C2f_PIG leads to a significant performance gain, as the C2f_PIG modules include enhanced neck structures with large-kernel convolutions, which enlarge the receptive field. When used together with the CAA module, this results in comprehensive improvements in detection performance. Finally, integrating the EUCB module further improves the recall by 1.3%. Although the precision decreases slightly due to the increase in detected targets, this has minimal negative impact, and both mAP50 and F1 score still increase by 0.6% and 0.3%, respectively. The ablation study results in Table 6 demonstrate the effectiveness of the proposed approach.
To further validate the real-time potential of the modules, this paper measured Frames Per Second(FPS) and latency based on the ablation experiment. The results are shown in Table 7. The results demonstrate that incorporating each module led to a decrease in FPS. This reduction is primarily attributable to the increased depth of the neural network layers after module integration, which consequently slows down inference speed. This sacrifice in FPS was made in exchange for enhanced performance. More specifically, the incorporation of the small object detection layer exerts the most significant impact on model real-time performance. In contrast, the addition of the CAA module induces no change in model inference time, demonstrating its potential for real-time deployment. For both the C2f_PIG and EUCB modules, their inclusion results in a degradation of real-time capability, primarily due to decreased inference speed. This decrease is an expected outcome resulting from the increased depth of the neural network. Importantly, PCPE-YOLO maintains an FPS above 90, indicating that the model achieves high performance while retaining strong real-time capabilities.
Table 7.
Results of FPS and latency.
| Model | P2 | CAA | C2f_PIG | EUCB | FPS | Latency(ms) | Preprocess(ms) | Inference(ms) | Postprocess(ms) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 133 | 7.5 | 0.2 | 4.4 | 2.9 | ||||
| 2 | √ | 100 | 10.0 | 0.2 | 6.4 | 3.4 | |||
| 3 | √ | √ | 97 | 10.3 | 0.2 | 6.4 | 3.7 | ||
| 4 | √ | √ | √ | 95 | 10.6 | 0.2 | 8.0 | 2.4 | |
| 5 | √ | √ | √ | √ | 91 | 11.0 | 0.2 | 8.3 | 2.5 |
Visualization results
To better demonstrate the superiority of the proposed model, YOLOv10l, YOLO11l, and YOLO12l with relatively higher mAP50 scores are selected for comparison. Visualization results are shown in Fig. 4. PCPE-YOLO successfully identifies extremely small objects in the images and accurately distinguishes foreground objects from background regions, effectively leveraging the limited features of small targets.
Fig. 4.
Visualization results on the Visdrone2019 dataset.
To further highlight the model’s attention to feature maps, heatmap visualizations of YOLOv8m and PCPE-YOLO are provided in Fig. 5, which drawn by Gradient-weighted Class Activation Mapping (Grad-cam)53. In terms of attention to small objects, YOLOv8m shows clear shortcomings compared to PCPE-YOLO, struggling to focus on the feature regions of small targets. This is primarily due to the dense distribution of targets and complex backgrounds, which reduce the model’s sensitivity to distant objects. After optimization, the proposed model exhibits significantly improved focus on small objects, reduces interference from environmental factors, and enhances its detection capability for small targets.
Fig. 5.
Visualization results on heatmap.
Discussion
Thanks to the dynamically reconfigurable C2f_PIG module, the model achieves a better balance between performance and lightweight design after replacing the original C2f modules. In small object detection tasks, the improved model demonstrates a significant accuracy improvement compared to the baseline, while reducing the number of parameters by 24.7%, effectively achieving a trade-off between precision and lightweightness. The proposed PCPE-YOLO model is thoroughly evaluated on three different datasets, and the results show that it achieves the best detection precision across all datasets. It should be noted that the current study primarily focuses on improving detection accuracy through architectural optimizations, such as the integration of lightweight components and attention mechanisms. However, it has not yet systematically explored the use of data augmentation or data expansion techniques to further address the challenges posed by small object occlusion and dense object distributions.
The VisDrone2019 dataset used in this study presents a high level of challenge, as a single image often contains hundreds of objects, leading to severe occlusion and overlapping. Excessive occlusion limits the model’s ability to learn effective object features, thereby affecting detection performance. To mitigate this issue, future work will explore the integration of Diffusion Model to improve and augment the dataset. By modeling image features through Diffusion Model, it is possible to maintain data diversity while controlling the number of objects and the degree of occlusion in the generated images, for example, through text prompts or layout guidance for fine-grained control. Such diffusion-generated samples are expected to further improve detection precision in downstream tasks and enhance the generalization capability of the model.
Conclusion
A novel object detection algorithm named PCPE-YOLO is proposed in this paper for lightweight small object detection. First, a dynamically reconfigurable C2f structure is introduced and further improved to form the new C2f_PIG module. Then, the CAA module is integrated to enhance the model’s ability to capture long-range contextual dependencies. A small object detection layer is added to better adapt the model to small object detection tasks, followed by the incorporation of the EUCB module to improve the model’s upsampling capability. On the VisDrone2019 dataset, the C2f_PIG module demonstrates both feasibility and superiority. Under the premise of reducing the parameter count by approximately 24.7%, PCPE-YOLO achieves improvements of 3.8% in precision, 5.6% in recall, 6.2% in mAP50, and 5% in F1 score compared to the original model. Compared with other models, PCPE-YOLO also exhibits strong overall performance. Generalization experiments on the KITTI and NWPU VHR-10 datasets show that PCPE-YOLO significantly outperforms the baseline model across multiple metrics, demonstrating good generalization ability. In summary, PCPE-YOLO provides substantial advantages in both small object detection and lightweight design, effectively balancing performance and the number of parameters.
Author contributions
Conceptualization, W.J.C.; methodology, W.J.C.; investigation, W.J.C. and T.L.; validation, W.J.C. and J.M.L.; formal analysis, W.J.C. and J.M.L.; data curation, T.L.; writing-original draft preparation, W.J.C. and T.L; writing-review and editing, Y.M.Z., and J.M.L.; visualization, W.J.C.; supervision, Y.M.Z. and J.M.L.; project administration, Y.M.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported in part by the National Natural Science Foundation of China (62403108, 42301256), the Aeronautical Science Foundation of China (20240001042002), the Fundamental Research Funds for the Central Universities (N2426005), the Ministry of Industry and Information Technology Project (TC220H05X-04), the Liaoning Provincial Natural Science Foundation Joint Fund (2023-MSBA-075), the Scientific Research Foundation of Liaoning Provincial Education Department (LJKQR20222509).
Data availability
The project code has been uploaded to Github (https://github.com/WeijiaChen123/PCPE-YOLO). The data generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Redmon, J. You only look once: Unified, real-time object detection.Proceedings of the IEEE conference on computer vision and pattern recognition. (2016).
- 2.Liu, W. et al. SSD: Single shot multibox detector. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 21–37. (2016).
- 3.Girshick, R. & Fast R-CNN. Preprint at https://arXiv/org/1504.08083, (2015).
- 4.Ren, S. et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39 (6), 1137–1149 (2016). [DOI] [PubMed] [Google Scholar]
- 5.He, K. et al. Mask R-CNN. Proceedings of the IEEE international conference on computer vision 2961–2969. (2017).
- 6.Jiang, P. et al. A review of Yolo algorithm developments. Procedia Comput. Sci.199, 1066–1073 (2022). [Google Scholar]
- 7.Carion, N. et al. End-to-end object detection with transformers. European Conference on Computer Vision. Cham: Springer International Publishing, 213–229. (2020).
- 8.Redmon, J. & Farhadi, A. YOLO9000: better, faster, stronger. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271. (2017).
- 9.Redmon, J. Yolov3: An incremental improvement. Preprint at https://arXiv/org/1804.02767, (2018).
- 10.Bochkovskiy, A., Wang, C. Y. & Liao, H. Y. M. Yolov4: optimal speed and accuracy of object detection. Preprint at https://arXiv/org/2004.10934. (2020)
- 11.Glenn Jocher. YOLOv5 [EB/OL]. (2020). https://github.com/ultralytics/yolov5
- 12.Li, C. et al. YOLOv6: A single-stage object detection framework for industrial applications. Preprint at https://arXiv/org/2209.02976.(2022)
- 13.Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7464–7475. (2023).
- 14.Jocher, G., Chaurasia, A. & Qiu, J. Ultralytics YOLOv8[EB/OL]. -01-01)[2024-11-29]. (2023). https://github.com/ultralytics/ultralytics
- 15.Wang, C. Y., Yeh, I. H. & Mark Liao, H. Y. Yolov9: Learning what you want to learn using programmable gradient information. European Conference on Computer Vision. Cham: Springer Nature Switzerland 1–21. (2024).
- 16.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Preprint at https://arXiv/org/2405.14458.
- 17.Khanam, R. & Hussain, M. Yolov11: An overview of the key architectural enhancements. Preprint at https://arXiv/org/2410.17725, (2024).
- 18.Tian, Y., Ye, Q. & Doermann, D. Yolov12: Attention-centric real-time object detectors. Preprint at https://arXiv/org/2502.12524, (2025).
- 19.Zhao, Y. et al. Detrs beat yolos on real-time object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.16965–16974. (2024).
- 20.Lin, T. Y. et al. Feature pyramid networks for object detection. Proceedings of the IEEE conference on computer vision and pattern recognition.2117–2125. (2017).
- 21.Liu, S. et al. Path aggregation network for instance segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition. 8759–8768. (2018).
- 22.Zheng, Z. et al. Distance-IoU loss: Faster and better learning for bounding box regression. Proceedings of the AAAI conference on artificial intelligence. 34 (07) 12993–13000. (2020).
- 23.Wu, J. et al. A Lightweight Small Object Detection Method Based on multi-layer Coordination Federated Intelligence for Coal Mine IoVT (IEEE Internet of Things Journal, 2024).
- 24.Pan, W. & Yang, Z. A lightweight enhanced YOLOv8 algorithm for detecting small objects in UAV aerial photography. Visual Comput., 1–17. (2025).
- 25.Lei, P., Wang, C., Liu, P. & RPS-YOLO: A recursive pyramid structure-based YOLO network for small object detection in unmanned aerial vehicle scenarios. Appl. Sci.15(4): 2039. (2025). [Google Scholar]
- 26.Wang, H. et al. YOLOv8-QSD: an Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8 (IEEE Transactions on Instrumentation and Measurement, 2024).
- 27.Tahir, N. U. A. et al. PVswin-YOLOv8s: UAV-based pedestrian and vehicle detection for traffic management in smart cities using improved YOLOv8. Drones8 (3), 84 (2024). [Google Scholar]
- 28.Bao, D. & Gao, R. YED-YOLO: an object detection algorithm for automatic driving. Signal, Image Video Proc., 18 (10): 7211–7219. (2024). [Google Scholar]
- 29.Im Choi, J. & Tian, Q. Saliency and location aware pruning of deep visual detectors for autonomous driving. Neurocomputing611, 128656 (2025). [Google Scholar]
- 30.Cao, J. et al. GCL-YOLO: A GhostConv-based lightweight Yolo network for UAV small object detection. Remote Sens.15 (20), 4932 (2023). [Google Scholar]
- 31.Liu, X. et al. YOLOv8-FDD: A real-time vehicle detection method based on improved YOLOv8. IEEE Access., (2024).
- 32.Shen, J. et al. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention fine-grained features. IEEE Trans. Instrum. Meas.73, 1–16 (2024). [Google Scholar]
- 33.Nguyen, H. et al. Enhanced object recognition from remote sensing images based on hybrid Convolution and transformer structure. Earth Sci. Inf.18, 228 (2025). [Google Scholar]
- 34.Than, P. M., Ha, C. K. & Nguyen, H. Long-range feature aggregation and occlusion-aware attention for robust autonomous driving detection. SIViP19, 738 (2025). [Google Scholar]
- 35.Ha, C. K., Nguyen, H. & Van, V. D. An optimized convolutional architecture for robust ship detection in SAR imagery. Intell. Syst. Appl.26, 200538 (2025). [Google Scholar]
- 36.Shen, J. et al. An algorithm based on lightweight semantic features for ancient mural element object detection. Npj Herit. Sci.13 (1), 70 (2025). [Google Scholar]
- 37.Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas.71, 1–13 (2021). [Google Scholar]
- 38.Sapkota, R. et al. YOLOv10 to its genesis: a decadal and comprehensive review of the you only look once (YOLO) series. Preprint at https//arxiv/org/2406.19407.(2024)
- 39.Chen, J. et al. Run, don’t walk: chasing higher FLOPS for faster neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12021–12031. (2023).
- 40.Yu, W. et al. Inceptionnext: When inception meets convnext. Proceedings of the IEEE/cvf conference on computer vision and pattern recognition 5672–5683. (2024).
- 41.Tang, Y. et al. GhostNetv2: enhance cheap operation with long-range attention. Adv. Neural. Inf. Process. Syst.35, 9969–9982 (2022). [Google Scholar]
- 42.Cai, X. et al. Poly kernel inception network for remote sensing detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 27706–27716. (2024).
- 43.Rahman, M. M., Munir, M., Marculescu, R. & Emcad Efficient multi-scale convolutional attention decoding for medical image segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11769–11779. (2024).
- 44.Du, D. et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. Proceedings of the IEEE/CVF international conference on computer vision workshops. 0–0. (2019).
- 45.Geiger, A. et al. Vision Meets robotics: the Kitti dataset. Int. J. Robot. Res.32 (11), 1231–1237 (2013). [Google Scholar]
- 46.Cheng, G., Zhou, P. & Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens.54 (12), 7405–7415 (2016). [Google Scholar]
- 47.Finder, S. E. et al. Wavelet convolutions for large receptive fields. European Conference on Computer Vision.(Springer Nature, 2024).
- 48.Yang, J. et al. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. Proceedings of the AAAI Conference on Artificial Intelligence 39 (9) 9202–9210. (2025).
- 49.Qiao, S., Chen, L. C., Yuille, A. & Detectors Detecting objects with recursive feature pyramid and switchable atrous convolution. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10213–10224. (2021).
- 50.Li, C., Zhou, A. & Yao, A. Omni-dimensional dynamic convolution. Preprint at https://arxiv/org/2209.07947, (2022).
- 51.Han, K. et al. GhostNets on heterogeneous devices via cheap operations. Int. J. Comput. Vision. 130 (4), 1050–1069 (2022). [Google Scholar]
- 52.Lv, D. et al. GS-YOLO: A lightweight SAR ship detection model based on enhanced GhostNetV2 and SE attention mechanism. IEEE Access., (2024).
- 53.Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision. 618–626. (2017).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The project code has been uploaded to Github (https://github.com/WeijiaChen123/PCPE-YOLO). The data generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.






























