Skip to main content
PLOS One logoLink to PLOS One
. 2026 Feb 20;21(2):e0342084. doi: 10.1371/journal.pone.0342084

FRCP-YOLO: Road object detection algorithm based on improved YOLOv8n

Dongmei Liu 1, Changchun Wang 1,*, Xuejun Li 1, Xiguo Zhao 1, Chuanli Yin 2, Yuchi Liu 1, Shuai Li 1, Xuan Li 1
Editor: Ankit Gupta3
PMCID: PMC12923005  PMID: 41719268

Abstract

The accuracy of road object detection is crucial for ensuring the safe driving of autonomous vehicles. Challenges such as small object missed detection, excessive parameters, low accuracy, and poor robustness are commonly observed in current road object detection models. To address the above problems, a road object detection model named FRCP-YOLO is proposed in the present study, which is developed based on the YOLOv8n. Firstly, to reduce the model’s parameters and complexity, the C2f module in the backbone network is replaced with a lightweight FasterNet Block, which enhancing the speed of image feature extraction; then the proposed R-CA module, which is based on a residual block with the Coordinate Attention (CA) mechanism, is introduced to enhance the model’s focus on objects of interest and improve its feature-learning capability. Secondly, to enhance small object detection performance, a high-resolution branch for feature extraction and a detection head for processing these features are introduced, thereby improving the model’s robustness. Finally, PIoU v2 is selected as the bounding box regression loss function to effectively prevent anchor box enlargement, enhance the ability to focus on anchor boxes, and further improve overall detection accuracy. Based on the KITTI dataset, the comparison experiments between FRCP-YOLO and other mainstream algorithms were carried out, FRCP-YOLO achieves object detection accuracies of 0.924 and 0.667 (in terms of mAP@50 and mAP@50–95) on the test set, representing improvements of 5.0% and 6.6% over the baseline model, while reducing parameters by 4%. Comparative experiments were conducted on the BDD100K dataset of complex road scenes. The detection accuracy of FRCP-YOLO outperforms other mainstream algorithms in challenging scenarios, such as dense traffic, occlusions, and night conditions, which verifies the generalization of FRCP-YOLO, highlighting its reliability and effective object detection capabilities in complex scenarios.

1 Introduction

As autonomous driving technology advances rapidly [13], research on road object detection, which perceives the surrounding environment [4,5], traffic conditions [6,7], and supports accurate driving decisions, has gained prominence. Road object detection aims to accurately perceive the external environment, especially to accurately identify and locate various objects in complex road environments such as dense occlusion, night and rainy days [8]. By detecting vehicles [9,10], pedestrians [11,12], lane lines [13,14], etc. on the road, potential dangers can be discovered in time, thereby achieving early warning, reducing the risk of traffic accidents [15], and improving the driving experience. As a core component of the autonomous driving system, the detection accuracy of road object detection technology is crucial for ensuring vehicle safety. Accurate road object detection can build a more intelligent and efficient traffic management system [16], and has significant practical value in improving road capacity and safety, optimizing traffic flow management, and improving the level of intelligent driving.

Road object detection algorithm based on deep learning has attracted much attention from researchers for its accuracy and efficiency. Currently, it is mainly exploring and improving multiple directions such as loss function, small object feature extraction method, network architecture, etc., and is committed to further improving the performance of object detection. Yang et al. [17] embedded a deformable convolution network (DCN) in the C2f module of the Backbone network to improve the model’s feature extraction ability under complex background conditions. And they added a global attention mechanism (GAM) to the Neck network to highlight important feature information. Wang et al. [18] embedded the BiFormer attention mechanism in the Neck layer of the original YOLOv8n model to effectively capture the association and dependency between features. Meanwhile, WIoU v3 was used as the loss function to enhance the model’s focus on high-quality anchor boxes. Dang et al. [19] added an attention module to the YOLOv5 backbone network to suppress non-critical information and used CIoU as the loss function to improve vehicle detection accuracy. Chen et al. [20] improved the detection accuracy of small objects by adding a small object detection head to the YOLOX network, and used the deep separable convolution method to optimize the Neck network, further improving the model’s computational efficiency. LI et al. [21] introduced spatial deep convolution (SPD-Conv) to replace the original convolution layer to solve the problem of small object detection in the network; in order to make the network pay more attention to the key information in the feature map, the NAM module was added after the C3 module in the backbone network. Liu et al. [22] proposed a high-resolution detection network (HRDNet) for detecting small objects, in which high-resolution input will be sent to a shallow network to retain more position information, and low-resolution input will be sent to a deep network to extract more semantics. Liu et al. [23] proposed a feature sharing detection head to reduce the redundant parameters of the model; the DWR module was used to enhance the feature extraction capability of the backbone network. Wang et al. [24] proposed an improved feature pyramid network AF-FPN, which uses an adaptive attention module (AAM) and a feature enhancement module (FEM) to reduce information loss in the feature map generation process and enhance the representation capability of the feature pyramid. Zhang et al. [25] reconstructed the YOLOv7 backbone network using the Res3Unit structure to improve the model’s ability to obtain more nonlinear features, and added a mixed attention mechanism module ACmix after the SPPCSPC layer to enhance the model’s attention to vehicles. The above improved algorithms have achieved better results in the road object detection task, but there is still room for improvement in the detection of small objects, the number of model parameters, and the detection accuracy, while the robustness in complex traffic scenarios needs to be improved.

The main contributions of the present study are as follows:

  • 1)

    To effectively reduce model’s parameters and computational complexity while improving the computational efficiency, the lightweight and fast FasterNet Block is utilized for road object feature extraction. Furthermore, to effectively overcome the model gradient disappearance and performance degradation problems, enhance the detection effect of the object of interest, and improve the detection accuracy, an efficient R-CA module is proposed by combining the residual block in the residual network with the CA mechanism.

  • 2)

    To enhance small object detection and mitigate missed detections, the small object feature extraction network within the Neck network is improved. Specifically, a dedicated branch for capturing small object feature is incorporated. Additionally, a specialized detection head for small object feature processing is integrated into the Head network, thereby boosting model performance in complex road scenes and overall detection accuracy.

  • 3)

    PIoU v2 is used as the bounding box regression loss function, which enhances the bounding box regression speed and improves the road object positioning accuracy of the algorithm. Incorporating a penalty factor into PIoU v2 effectively prevents anchor boxes from becoming excessively large, while the introduction of a non-monotonic attention function further improves focus on anchor boxes.

2 Related work

2.1 YOLOv8n

YOLOv8 [26] achieves precise positioning and detection of objects by optimizing network architecture, lightweight design, and utilizing multiscale feature fusion. It is well-suited for autonomous driving applications scenarios that demand real-time performance and high accuracy, and it is straightforward to use and deploy. According to the model’s parameters and performance, YOLOv8 is divided into five versions, among which YOLOv8n is the lightest and fastest network. Due to the limited computing resources of autonomous driving on-board equipment, the object detection model used is required to be both lightweight and fast, YOLOv8n perfectly meets this requirement; it has achieved excellent detection accuracy on public datasets and is suitable for detecting different road object scenarios. Therefore, YOLOv8n is selected as the basic model, which consists of three parts: Backbone, Neck and Head network. The overall structure is shown in Fig 1.

Fig 1. YOLOv8n model structure diagram.

Fig 1

2.1.1 Backbone network.

The backbone network of YOLOv8n adopts a complex convolutional neural network primarily for image feature extraction. It replaces the C3 module with the C2f module based on the YOLOv5 [27] network, making the gradient flow information richer and improving the convergence speed of the model, the C2f structure is shown in Fig 1. Moreover, it is combined with convolutional layers to extract high-level semantic features from the input image [28], further enhancing the network’s feature extraction capability.

2.1.2 Neck network.

The Neck network serves as a crucial link between Backbone and Head networks, comprising the SPPF layer and an improved Path Aggregation Network (PANet) [29]. Its primary function is to fuse the multi-scale features extracted by the Backbone, which is essential for enhancing the overall object detection performance. Specifically, the SPPF layer is used to process features at different scales, which not only improves the feature expression capability, but also speeds up the inference speed of the model compared with SPP. The PANet structure comprises two components: the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN). FPN adopts top-down method to connect deep and shallow feature maps, while PAN adopts bottom-up feature pyramid network structure, which makes it easier to transfer the bottom information to the top of the upper level, and fully integrates multiscale features.

2.1.3 Head network.

The Head network includes a total of three detection heads corresponding to the feature maps of three different scales in the Neck network, which are mainly responsible for processing the extracted feature maps and generating the final prediction results, including the bounding box coordinates, the object confidence scores, and the category labels. The loss function part uses Binary Cross Entropy Loss (BCE Loss) as the classification loss, and Distribution Focal Loss (DFL) and Complete Intersection over Union (CIoU) [30] Loss as the regression loss. The Head module of YOLOv8 is improved compared with YOLOv5: 1) Improvement of the Coupled-Head structure to Decoupled-Head structure to extract position information and category information respectively; 2) Meanwhile, from the Anchor-Based approach to the Anchor-Free approach, the center point and width-to-height ratio of the object are predicted directly. The above improvements enhance the detection speed and accuracy of the YOLOv8 network model.

3 FRCP-YOLO

Although the YOLOv8n network exhibits satisfactory performance in general object detection tasks, it still suffers from missed and false detections of small objects, insufficient focus capability, and limited detection accuracy in road object detection for autonomous driving scenarios. To address these issues, an improved detection framework named FRCP-YOLO is proposed, as illustrated in Fig 2. The specific improvements over the YOLOv8n network are described in Sections 3.1–3.3.

Fig 2. FRCP-YOLO network structure.

Fig 2

3.1 Improvement of the FRCP-YOLO backbone network

This section provides an in-depth analysis of the primary components of the FRCP-YOLO backbone network, including the FasterNet Block and R-CA module, and presents a comparative analysis of the backbone structure prior to and following the improvements, as shown in Fig. 3. FasterNet Block is introduced to reduce the number of model’s parameters and improve computational efficiency. The R-CA module is proposed to address gradient disappearance in model training and enhance the model’s learning ability.

Fig 3. Structural comparison of backbone networks.

Fig 3

3.1.1 FasterNet block.

The C2f convolution module in the YOLOv8n network significantly improves the model’s feature extraction capability by increasing the number of Bottleneck modules, but also increases the number of parameters. Meanwhile, the feature maps between different channels are highly similar, generating redundant channel information, causing the model to frequently access memory during operation and increase redundant information calculation. To solve the above problems, the FasterNet Block is adopted, which is designed with Partial Convolution (PConv) and Pointwise Convolution (PWConv) as proposed in the FasterNet [31] neural network. The structure is shown in Fig 4.

Fig 4. FasterNet Block structure diagram.

Fig 4

Lightweight networks such as MobileNets, ShuffleNets, and GhostNet, primarily utilize depthwise convolutions (DWConv) and group convolutions (GConv) for feature extraction. While these methods have achieved notable reductions in model complexity, they often lead to frequent memory accesses and increased computational overhead [31]. In contrast to these conventional lightweight convolutions, PConv applies convolution for feature extraction on only a part of channels Cp of the input feature map Ih×w×c, and leaving the remaining channels untouched, subtly reduces the computation of redundant feature maps in the channels, largely reduces memory accesses, and improves the detection speed of the model. To more effectively utilize the information of all channels, PWConv is introduced, and the overall receptive field is like a T-shaped convolution, which enhances the model’s attention to the central area of the image. The structure is shown in Fig 5.

Fig 5. PConv + PWConv illustration.

Fig 5

Compared with modules such as C2f, FasterNet Block not only solves the problem of frequent memory accesses and channel redundant information calculation of the model, but also reduces model complexity, improves training efficiency. It achieves a better balance between parameter count and detection accuracy, thereby enabling more efficient spatial feature extraction.

3.1.2 R-CA module.

Attention mechanism is one of the common methods to improve the performance of deep neural networks. The widely adopted Squeeze-and-Excitation attention mechanism [32] (SE) reduces computational cost; however, it solely considers channel information and overlooks crucial positional information. In contrast, the CBAM attention mechanism [33] incorporates positional information but is limited to capturing local relationships and involves a large number of parameters. Coordinate Attention (CA) [34] is a lightweight attention mechanism suitable for mobile devices, which enhances accuracy by integrating positional information into channel attention, as illustrated in Fig 6.

Fig 6. CA structure diagram.

Fig 6

Specifically, for any input X=[x1,x2,,xc]𝔽H×W×C, where H and W represent the height and width of the input feature map, and the position information is accurately encoded along the horizontal and vertical directions using two pooling kernels of sizes (H, 1) and (1, W) for channel xc, respectively, and the attention feature maps in the two directions are obtained. The outputs of the channel xc are shown in Eqs. (1) and (2), respectively:

Zch(h)=1W0iWxc(h,i) (1)
Zcw(w)=1H0jHxc(j,w) (2)

To fully utilize the obtained information, the above two feature maps are concatenated and convolved to produce an intermediate feature map, as shown in Eq. (3):

f=δ(F1([Zh,Zw])) (3)

Where [*,*] denotes the concatenation operation along the spatial dimension, δ is the nonlinear activation function, and F1 is the convolutional transformation function. The tensor f is then divided into two separate tensors fw and fw along the spatial dimension, followed by the application of convolutional transformation and nonlinear activation, respectively, as shown in Eqs. (4) and (5):

gh=σ(Fh(fh)) (4)
gw=σ(Fw(fw)) (5)

Where σ is the sigmoid function, Fh and Fw perform convolutional transformations on fh and fw respectively, gh and gw are used as the attention weights in the horizontal and vertical directions. The output of the CA mechanism is shown in Eq (6):

YCA=xc(i,j)×gch(i)×gcw(j) (6)

This demonstrates that the CA mechanism captures long-range dependencies along one spatial direction while preserving precise positional information along the other, thereby enhancing the model’s feature extraction capabilities.

As the number of network layers increases, the model encounters issues such as gradient disappearance, which leads to difficult training of the network and reduced model performance. Therefore, the residual block from residual networks [35] is adopted to effectively address the deterioration in model performance caused by an increase in network layers. Combining the CA mechanism and the residual block, the R-CA module is proposed, the structure is shown in Fig 7, and the expressions of R-CA are shown in (9).

Fig 7. R-CA structure diagram.

Fig 7

M=fconv1×1{fconv3×3[fconv1×1(F)]} (7)
K=Mc(i,j)×gch(i)×gcw(j) (8)
Y=K+F (9)

Where F represents the input feature map. fconv1×1 and fconv3×3 denote standard convolution operations with kernel sizes of 1 × 1 and 3 × 3, respectively. M is the output feature map after standard convolution. K is the feature map processed by the CA mechanism, and Y is the final output feature map.

Incorporating the CA mechanism within the residual block combines the strengths of both modules, enabling the model to focus on critical image regions and enhance its recognition accuracy. The CA mechanism can make the gradient flow propagate better in the model by weighting and adjusting the features. Simultaneously, the residual connection allows the network to effectively capture the difference between input and output, reinforce shallow feature learning, achieve more comprehensive learning of the object features in the image, and enhance the model’s generalization capability.

3.2 Improving small object feature extraction network

When YOLOv8 model detects road objects of varying sizes, due to the fact that small objects are easy to be occluded and confused and the feature map area is small, leading to increased chances of missed detection of small objects. This issue is particularly prevalent in complex road scenes containing small objects. To address the above problems and improve the model’s small object detection capability, a dedicated network for small object feature extraction is introduced, as highlighted within the green dotted box in Fig 2. First, the PANet structure in the Neck network is improved by adding an additional branch for small object feature extraction to capture finer details in the image. The number of branches is increased from three to four, resulting in a small object feature map with a high-resolution of 160 × 160 and rich details. Additionally, a detection head is added to the original Head network to process shallow feature maps with more detailed information, further improving the model’s ability to detect small objects.

Aiming at the problems of YOLOv8 model in small object detection, the small object feature extraction network is improved, so that the model more effectively fuses multiscale features, effectively reduces the loss of object features, comprehensively and accurately detects objects of various sizes, and at the same time can cope with the interference of the complex scene such as light change and object occlusion, which improves the robustness of the model.

3.3 Improving the bounding box regression loss function

The bounding box regression loss function quantifies the discrepancy between predicted and ground truth bounding boxes, serving as a pivotal component in the object detection algorithm. A lower loss value indicates that predicted box is closer to ground truth box, which results in better detection performance by the model.

The Complete Intersection over Union (CIoU) [30] bounding box regression loss function is employed in YOLOv8n, which comprehensively accounts for the distance, overlap area, and aspect ratio between bounding boxes. Fig 8 illustrates the intersection between the predicted and ground truth boxes, and Equation (10) present the corresponding calculation formulas.

Fig 8. CIoU illustration.

Fig 8

LCIoU=LIoU+ρ2(bp,bg)c2+4π2(arctanwghgarctanwphp)2α (10)
α=v1LIoU (11)

Where LIoU is the intersection loss of the predicted and ground truth boxes, bp and bg denote the centroids of the prediction and ground truth boxes, respectively, c represents the diagonal length of the minimum outer rectangle, and ρ2(bp,bg) is the distance between the centroids of the two boxes; wp, hp and wg, hg are the widths and heights of the predicted and ground truth boxes; and v represents the correction factor, while α denotes the weighting coefficients.

Since the CIoU loss function involves multi-class operations, requires substantial computation time and resources, and exhibits insufficient focus on small objects, it fails to meet the requirements for fast and accurate road object detection in autonomous driving scenarios. Therefore, the efficient Powerful-IoU v2 (PIoU v2) [36] loss function is introduced. The computational formula LPIoU is shown in equation (12), and the intersection schematic between predicted and ground truth boxes is shown in Fig 9.

Fig 9. PIoU illustration.

Fig 9

LPIoU=LIoU+1ep2 (12)
P=14(dw1wg+dw2wg+dh1hg+dh2hg) (13)

Where P is the penalty factor, dw1 and dw2, dh1 and dh2 represent the absolute distances between the corresponding sides of the predicted and ground truth boxes.

Existing bounding box regression loss functions, such as CIoU, EIoU, and SIoU, suffer from the use of inappropriate penalty factors, which lead to enlarged anchor boxes, slower convergence, and insufficient consideration of object scale [36]. To overcome these limitations, a penalty factor q is incorporated into PIoU v2, which effectively avoids the oversized anchor box and steer the anchor box to return to the ground truth box through a direct and effective way, so that the anchor box more closely match the ground truth box. Additionally, a non-monotonic attention function u(x) is introduced to adaptively adjust gradients for the anchor box quality, thereby improving the focus on medium-quality and high-quality anchor boxes, and the formula for the PIoU v2 loss function is shown in the following equation:

q=ep,q[0,1] (14)
u(x)=3x·ex2 (15)
LPIoUv2=u(λq)·LPIoU (16)
LPIoUv2=3u(λq)·e(λq)2·LPIoU (17)

In this case, the penalty factor p is replaced by q, with the hyperparameter λ set to 1.3. This modification enhances the convergence speed and detection accuracy of the PIoU v2 bounding box regression loss function.

4 Experimental results and analysis

4.1 Dataset

The autonomous driving dataset KITTI [37], jointly created by the Karlsruhe Institute of Technology in Germany and the Toyota Technological Institute of America, is used in this experiment. It contains a large number of overlapping and small objects, covering a variety of road scenarios such as countryside, city center as well as highway, with a total of Tram, Truck, Car, Person sitting, Pedestrian, Van, Cyclist, Misc, and DontCare 9 categories. After comprehensively considering the actual application of the autonomous driving system, Car, Tram, Truck, and Van are uniformly classified as Car, Person sitting and Pedestrian are uniformly classified as Pedestrian, Cyclist is retained, Misc and DontCare are deleted, and a total of 3 categories (Car, Pedestrian, and Cyclist) are set. Since the test set images in KITTI do not have labeled data, this experiment only uses the labeled training set images, a total of 7481 images, and randomly divides this dataset into the training, validation, and test sets according to the ratio of 8:1:1.

4.2 Experimental environment and parameter settings

The platform configuration for this experiment is presented in Table 1. All models were trained for 200 epochs with a batch size of 16 samples. The input image size was set to 640 × 640, the learning rate to 0.01, momentum to 0.937, and the weight decay coefficient to 0.0005.

Table 1. Experimental platform configuration.

Configuration Parameter
CPU 12th Gen Intel(R) Core(TM) i5-12600KF
GPU NVIDIA GeForce RTX 3060
Operating system Windows 10
Python 3.8.0
Accelerated environment CUDA 11.8, CUDNN 8.6.0
Development environment Pycharm 2023.2.6
Deep learning framework Pytorch 2.0.0

4.3 Evaluation metrics

The present study uses four key evaluation metrics of object detection to comprehensively evaluate the detection performance of different models, including recall (R), precision (P), mean average precision (mAP), and parameters.

The corresponding formula is presented below:

P=TPTP+FP
R=TPTP+FN
mAP=1Ni=1NAPi

Where True positives (TP) represents the number of correctly classified as positive examples, False positives (FP) represents the number of incorrectly classified as positive examples, False negatives (FN) represents the number of incorrectly classified as negative examples. Let APi denote the average precision of the ith category, and let N represent the total number of categories. The number of parameters is used to assess model complexity. As the number of parameters increases, the model becomes more complex and demands greater computational resources.

4.4 Experimental results

4.4.1 Ablation experiment.

To explore the impact of the proposed FasterNet Block, improved small object feature extraction network, and R-CA module on detection performance, the effectiveness of the improvements is verified one by one based on the baseline model. Five groups of ablation experiments are designed based on the KITTI dataset. The test results are shown in Table 2. Fig 10 shows the comparative relationship between the evaluation metrics of different ablation experiments.

Table 2. Ablation experiment results.
Model Method P R mAP@50 mAP@50–95
A B C
Model 1 0.916 0.774 0.874 0.601
Model 2 0.868 0.791 0.862 0.584
Model 3 0.907 0.835 0.901 0.64
Model 4 0.88 0.826 0.873 0.601
Model 5 0.915 0.844 0.911 0.66

Note: A denotes the lightweight and efficient FasterNet Block, B refers to the improved small object feature extraction network, C represents the efficient R-CA module, and model 1 is the YOLOv8n model.

Fig 10. Comparison of ablation experiment results.

Fig 10

In Model 2, the FasterNet Block is introduced to lighten the original Backbone network, which significantly reduces the complexity of the model, solves the problem of frequent access to memory and channel redundant information, and improves the training efficiency of the model, but the detection accuracy is affected to a certain extent. Model 3 enhances the small object feature extraction network from Model 2, which can be seen in Fig 10 improves all the evaluation metrics, especially R (from 0.791 to 0.835) and mAP@50–95 (from 0.584 to 0.64), which effectively reduces the small object missed detection on the road, and the detection performance has been substantially improved. Compared to Model 2, Model 4 incorporates the R-CA module, resulting in improvements across all evaluation metrics. Specifically, P and R increase by 1.2% and 3.5%, while mAP@50 and mAP@50–95 improve by 1.1% and 1.7%, respectively. These results suggest that the network places greater emphasis on key regions in the image and more effectively learns residual information. Finally, the three proposed improvements are integrated into Model 5. Compared with the original model YOLOv8n, P remains almost unchanged, while R, mAP@50 and mAP@50–95 are increased by 7.0%, 3.7% and 5.9% respectively, which significantly improves the road object detection performance and shows the effectiveness of detecting various object categories.

The ablation experiment results demonstrate that the FasterNet Block, improved small object feature extraction network, and R-CA module proposed in this study effectively improve the performance of the FRCP-YOLO model without any conflict.

4.4.2 Comparative evaluation of bounding box regression loss functions.

Using the Model 5 with better detection performance in section 4.4.1 as the baseline model, experiments were conducted on common loss functions under the same training conditions to study the impact of loss functions on the model’s object detection performance. Table 3 gives the object detection evaluation metrics of the PIoU v2 loss function and other bounding box loss functions on the KITTI dataset.

Table 3. Performance comparison of different loss functions.
Loss P R mAP@50 mAP@50–95
CIoU 0.915 0.844 0.911 0.66
SIOU 0.912 0.841 0.924 0.665
WIoU v1 0.875 0.86 0.915 0.656
WIoU v2 0.91 0.849 0.917 0.663
WIoU v3 0.902 0.865 0.924 0.661
EIoU 0.909 0.838 0.904 0.645
Focal IoU 0.789 0.682 0.782 0.567
PIoU v2 0.906 0.86 0.924 0.667

As shown in Table 3, the PIoU v2 loss function has better detection performance on Model 5. Compared with the loss function CIoU used by the original network YOLOv8n, R, mAP@50 and mAP@50–95 are increased by 1.6%, 1.3% and 0.7% respectively, and P is slightly reduced. In general, PIoU v2 improves the overall performance of the model better than other loss functions. It not only avoids the anchor box from enlargement, but also enhances the focusing ability of the anchor box. The experimental results show that the PIoU v2 loss function is effective in improving the performance of Model 5.

4.4.3 Comparative experiments of different mainstream algorithms.

To evaluate the effectiveness of the proposed algorithm in road object detection for autonomous driving scenarios, a comparative experiment was conducted using FRCP-YOLO, YOLOv8, and other mainstream object detection algorithms on the KITTI test set. The results are presented in Table 4 and Fig 11.

Table 4. Test indicators of different algorithms.
Algorithm P R mAP@50 mAP@50–95 Parameters/106
YOLOv3-tiny 0.889 0.731 0.834 0.508 8.671
YOLOv5n 0.892 0.799 0.877 0.566 1.763
YOLOv7-tiny 0.862 0.812 0.87 0.557 6.013
YOLOv8n 0.916 0.774 0.874 0.601 3.006
YOLOv9t 0.854 0.796 0.87 0.583 2.802
YOLOv10n 0.86 0.791 0.855 0.592 2.696
YOLOv11n 0.895 0.785 0.867 0.592 2.583
YOLOv13n 0.876 0.789 0.87 0.602 2.45
RT-DETR-l 0.849 0.809 0.881 0.593 31.99
RT-DETR-resnet50 0.876 0.807 0.894 0.618 41.94
FRCP-YOLO 0.906 0.86 0.924 0.667 2.966
Fig 11. Comparison of detection metrics of different algorithms.

Fig 11

Experimental results demonstrate that FRCP-YOLO achieves excellent detection performance while maintaining a lightweight architecture. With only 2.966M parameters, FRCP-YOLO is significantly smaller than larger models such as YOLOv3-tiny (8.671M) and YOLOv7-tiny (6.013M), yet it achieves the highest detection performance among them. Compared with lightweight models such as YOLOv5n, YOLOv9t, YOLOv10n, YOLOv11n, and YOLOv13n, FRCP-YOLO delivers superior detection accuracy while maintaining a compact design. Furthermore, compared with Transformer-based object detectors such as RT-DETR-l and RT-DETR-ResNet50, FRCP-YOLO demonstrates a clear advantage by achieving a substantial reduction in parameter count and computational cost while sustaining leading detection accuracy. Relative to the baseline YOLOv8n model, FRCP-YOLO reduces the number of parameters by 4% while significantly enhancing detection performance, with R increased by 8.6%, and mAP@50 and mAP@50–95 improved by 5.0% and 6.6%, respectively. Overall, these results clearly demonstrate that FRCP-YOLO effectively combines a lightweight design with robust and accurate detection capability for road object detection tasks.

4.5 Verification of the generalization ability of the FRCP-YOLO model

To further verify the generalization ability of FRCP-YOLO, road detection experiments were conducted on the BDD100K [38] dataset released by the Berkeley AI Lab (BAIR) in 2018 under various weather conditions (such as sunny, cloudy, rainy, etc.) and across different times of day and night. Considering that the number of images containing Motor and Bike objects in the BDD100K dataset is small and the characteristics of the two are similar, the two are classified as Two-Wheelers, and then Pedestrian, Car, Truck, Bus, and Two-Wheelers are used as the objects of road object detection. A random selection of 10,000 images from the BDD100K dataset is used to evaluate the model’s generalization ability, which is divided into the training set, validation set, and test set according to the ratio of 8:1:1. The performance of FRCP-YOLO was compared with other mainstream algorithms from Section 4.5 on this dataset for object detection, with the experimental results presented in Table 5 and Fig 12.

Table 5. Test indicators of different algorithms.

Algorithm P R mAP@50 mAP@50–95
YOLOv3-tiny 0.496 0.328 0.331 0.17
YOLOv5n 0.577 0.413 0.428 0.231
YOLOv7-tiny 0.576 0.452 0.476 0.266
YOLOv8n 0.565 0.393 0.447 0.267
YOLOv9t 0.536 0.454 0.469 0.279
YOLOv10n 0.55 0.397 0.43 0.25
YOLOv11n 0.544 0.416 0.444 0.262
YOLOv13n 0.485 0.428 0.425 0.251
RT-DETR-l 0.561 0.381 0.402 0.213
RT-DETR-resnet50 0.546 0.417 0.424 0.238
FRCP-YOLO 0.587 0.481 0.513 0.307

Fig 12. Comparison of detection metrics of different algorithms.

Fig 12

The experimental results show that the proposed road object detection model outperforms other network models across all metrics. Compared with the baseline model YOLOv8n, FRCP-YOLO has higher detection performance. Specifically, P increased by 2.2%, R increased by 8.8%, mAP@50 and mAP@50–95 increased by 6.6% and 4%, respectively. A large number of comparative experimental results show that the FRCP-YOLO can accurately perceive the external environment in complex road scenes and realize accurate detection of road objects, with strong adaptability and generalization ability.

5 Conclusion

Aiming at the problems of small object missed detection, large number of parameters, and low detection accuracy of existing models in complex road scenes, a lightweight and efficient road object detection model named FRCP-YOLO is proposed. Specifically, the lightweight and fast FasterNet Block is utilized to extract features from a part of the input channels, by cutting down model’s parameters and redundant computation; combined with the Coordinate Attention mechanism and residual block, an efficient R-CA module is proposed to enhance the detection of objects of interest and overcome the problems of gradient disappearance and performance degradation; the small object feature extraction network is improved to capture tiny details in the image, reduce the loss of object information, decrease the impact of complex road scenes, and improve the robustness of the model; PIoU v2 is introduced as the bounding box loss function to guides anchor boxes to regress along efficient paths, resulting in faster convergence and improved detection accuracy.

Experimental results show that FRCP-YOLO performs better in road object detection tasks based on the KITTI dataset and the BDD100K dataset containing complex road environments. Compared with the baseline model, its accuracy increased by 5.0% and 4% respectively (in terms of mAP@50), while R increased by 8.6% and 8.8%, and the number of parameters decreased by 4%. Compared with the other mainstream algorithms, FRCP-YOLO has achieved detection accuracy superior to the other mainstream algorithms while realizing lightweight, has better generalization, effectively solves the problems of small object missed detection, large number of parameters and low detection accuracy in complex road scenarios, and is suitable for object detection of autonomous driving vehicle in complex roads. This study evaluates the improved model from multiple dimensions such as detection accuracy and the number of parameters, and provides new ideas and methods for subsequent autonomous driving visual perception research. Despite these promising results, the present study still has certain limitations, and future work will focus on optimizing the model from the following two aspects:

  • (1)

    In the experiment, only major road objects such as pedestrians and motor vehicles were detected. In the future, road objects such as traffic lights and traffic signs will be included in the research scope.

  • (2)

    The next step will be to consider adding a semantic segmentation network based on the FRCP-YOLO model to complete lane line detection and realize the joint detection of road objects and lane lines for autonomous driving vehicles.

Data Availability

Link to Kitti dataset: https://www.cvlibs.net/datasets/kitti/ Link to BDD00k dataset: https://github.com/WCC0810/Minimal-dataset__BDD100K-subset/releases.

Funding Statement

This study was financially supported by the Jilin Provincial Enterprise “Science and Technology Innovation Commissioner (Deputy Chief Technology Officer)” Program in 2024 in the form of an award received by DL, CW, XL and XZ. No additional external funding was received for this study. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cai Z, Chen R, Wu Z, Xue W. YOLOv8n-FAWL: Object Detection for Autonomous Driving Using YOLOv8 Network on Edge Devices. IEEE Access. 2024;12:158376–87. doi: 10.1109/access.2024.3480976 [DOI] [Google Scholar]
  • 2.Liang S, Wu H, Zhen L, Hua Q, Garg S, Kaddoum G, et al. Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous Vehicles. IEEE Trans Intell Transport Syst. 2022;23(12):25345–60. doi: 10.1109/tits.2022.3158253 [DOI] [Google Scholar]
  • 3.Wang H, Liu C, Cai Y, Chen L, Li Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans Instrum Meas. 2024;73:1–16. doi: 10.1109/tim.2024.3379090 [DOI] [Google Scholar]
  • 4.Natan O, Miura J. DeepIPC: Deeply Integrated Perception and Control for an Autonomous Vehicle in Real Environments. IEEE Access. 2024;12:49590–601. doi: 10.1109/access.2024.3385122 [DOI] [Google Scholar]
  • 5.Wu W, Liu C, Zheng H. A panoramic driving perception fusion algorithm based on multi-task learning. PLoS One. 2024;19(6):e0304691. doi: 10.1371/journal.pone.0304691 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wang L, Law KLE, Lam C-T, Ng B, Ke W, Im M. Automatic Lane Discovery and Traffic Congestion Detection in a Real-Time Multi-Vehicle Tracking Systems. IEEE Access. 2024;12:161468–79. doi: 10.1109/access.2024.3483439 [DOI] [Google Scholar]
  • 7.Su G, Shu H. Traffic flow detection method based on improved SSD algorithm for intelligent transportation system. PLoS One. 2024;19(3):e0300214. doi: 10.1371/journal.pone.0300214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zhao J, Zhao W, Deng B, Wang Z, Zhang F, Zheng W, et al. Autonomous driving system: A comprehensive survey. Exp Syst Appl. 2024;242:122836. doi: 10.1016/j.eswa.2023.122836 [DOI] [Google Scholar]
  • 9.Chen C, Wang C, Liu B, He C, Cong L, Wan S. Edge Intelligence Empowered Vehicle Detection and Image Segmentation for Autonomous Vehicles. IEEE Trans Intell Transport Syst. 2023;24(11):13023–34. doi: 10.1109/tits.2022.3232153 [DOI] [Google Scholar]
  • 10.Kang L, Lu Z, Meng L, Gao Z. YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Exp Syst Appl. 2024;237:121209. doi: 10.1016/j.eswa.2023.121209 [DOI] [Google Scholar]
  • 11.Yin Y, Zhang Z, Wei L, Geng C, Ran H, Zhu H. Pedestrian detection algorithm integrating large kernel attention and YOLOV5 lightweight model. PLoS One. 2023;18(11):e0294865. doi: 10.1371/journal.pone.0294865 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Zhang X, Zhang X, Wang J, Ying J, Sheng Z, Yu H, et al. TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection. IEEE Trans Neural Netw Learn Syst. 2025;36(7):13276–90. doi: 10.1109/TNNLS.2024.3443455 [DOI] [PubMed] [Google Scholar]
  • 13.Deng T, Wu Y. Simultaneous vehicle and lane detection via MobileNetV3 in car following scene. PLoS One. 2022;17(3):e0264551. doi: 10.1371/journal.pone.0264551 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Liu J, Deng B, Zou C, Zeng B, Zhang W, Tan J. Multi-lane detection by combining line anchor and feature shift for urban traffic management. Eng Appl Artif Intell. 2023;123:106238. doi: 10.1016/j.engappai.2023.106238 [DOI] [Google Scholar]
  • 15.Muhammad K, Ullah A, Lloret J, Ser JD, de Albuquerque VHC. Deep Learning for Safe Autonomous Driving: Current Challenges and Future Directions. IEEE Trans Intell Transport Syst. 2021;22(7):4316–36. doi: 10.1109/tits.2020.3032227 [DOI] [Google Scholar]
  • 16.Liu Q, Liu Y, Lin D. Revolutionizing Target Detection in Intelligent Traffic Systems: YOLOv8-SnakeVision. Electronics. 2023;12(24):4970. doi: 10.3390/electronics12244970 [DOI] [Google Scholar]
  • 17.Yang B, Hu Z. Automatic driving target detection based on YOLOv8n improved algorithm. Control Eng China. 2025;32(12):2177–84. [Google Scholar]
  • 18.Wang B, Li Y-Y, Xu W, Wang H, Hu L. Vehicle–Pedestrian Detection Method Based on Improved YOLOv8. Electronics. 2024;13(11):2149. doi: 10.3390/electronics13112149 [DOI] [Google Scholar]
  • 19.Dong X, Yan S, Duan C. A lightweight vehicles detection network model based on YOLOv5. Eng Appl Artif Intell. 2022;113:104914. doi: 10.1016/j.engappai.2022.104914 [DOI] [Google Scholar]
  • 20.Chen S, Ni S, Tong L. Improved YOLOX Algorithm for Object Detection in Autonomous Driving Scenarios. J Comput Eng Appl. 2024;60(12):225–33. [Google Scholar]
  • 21.Li F, Zhao Y, Wei J, Li S, Shan Y. SNCE-YOLO: An Improved Target Detection Algorithm in Complex Road Scenes. IEEE Access. 2024;12:152138–51. doi: 10.1109/access.2024.3481642 [DOI] [Google Scholar]
  • 22.Liu Z, Gao G, Sun L, Fang Z. HRDNet: High-Resolution Detection Network for Small Objects. In: 2021 IEEE International Conference on Multimedia and Expo (ICME). 2021. 1–6. doi: 10.1109/icme51207.2021.9428241 [DOI] [Google Scholar]
  • 23.Liu X, Wang Y, Yu D, Yuan Z. YOLOv8-FDD: A Real-Time Vehicle Detection Method Based on Improved YOLOv8. IEEE Access. 2024;12:136280–96. doi: 10.1109/access.2024.3453298 [DOI] [Google Scholar]
  • 24.Wang J, Chen Y, Dong Z, Gao M. Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Comput Applic. 2022;35(10):7853–65. doi: 10.1007/s00521-022-08077-5 [DOI] [Google Scholar]
  • 25.Zhang Y, Sun Y, Wang Z, Jiang Y. YOLOv7-RAR for Urban Vehicle Detection. Sensors (Basel). 2023;23(4):1801. doi: 10.3390/s23041801 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.ULTRALYTICS. YOLOv8. Available from: https://github.com/ultralytics/ultralytics
  • 27.Khanam R, Hussain M. What is YOLOv5: A deep look into the internal features of the popular object detector. arxiv preprint. arxiv:2407.20892. [Google Scholar]
  • 28.Yaseen M. What is YOLOv8: An in-depth exploration of the internal features of the next-generation object detector. arxiv preprint. 2024. doi: arxiv:2408.15857 [Google Scholar]
  • 29.He K, Zhang X, Ren S, Sun J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16. doi: 10.1109/TPAMI.2015.2389824 [DOI] [PubMed] [Google Scholar]
  • 30.Zheng Z, Wang P, Liu W, Li J, Ye R, Ren D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. AAAI. 2020;34(07):12993–3000. doi: 10.1609/aaai.v34i07.6999 [DOI] [Google Scholar]
  • 31.Chen J, Kao SH, He H, Zhuo W, Wen S, Lee CH. Run, don’t walk: chasing higher FLOPS for faster neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 12021–31. [Google Scholar]
  • 32.Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018. p. 7132–41. doi: 10.1109/cvpr.2018.00745 [DOI] [Google Scholar]
  • 33.Woo S, Park J, Lee JY, Kweon IS. Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 3–19. [Google Scholar]
  • 34.Hou Q, Zhou D, Feng J. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 13713–22. [Google Scholar]
  • 35.He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 770–8. doi: 10.1109/cvpr.2016.90 [DOI] [Google Scholar]
  • 36.Liu C, Wang K, Li Q, Zhao F, Zhao K, Ma H. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Netw. 2024;170:276–84. doi: 10.1016/j.neunet.2023.11.041 [DOI] [PubMed] [Google Scholar]
  • 37.Geiger A, Lenz P, Urtasun R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. 2012. p. 3354–61. doi: 10.1109/cvpr.2012.6248074 [DOI] [Google Scholar]
  • 38.Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, et al. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 2636–45. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Link to Kitti dataset: https://www.cvlibs.net/datasets/kitti/ Link to BDD00k dataset: https://github.com/WCC0810/Minimal-dataset__BDD100K-subset/releases.


Articles from PLOS One are provided here courtesy of PLOS

RESOURCES