Enhancing small target traffic sign detection with ML_SAP in YOLOv5s

Zhenguo Lu; Zhibo Zhu; Weipeng Xu; Guixian Li; Jinyang Chen

doi:10.1038/s41598-024-76804-0

. 2024 Oct 29;14:25904. doi: 10.1038/s41598-024-76804-0

Enhancing small target traffic sign detection with ML_SAP in YOLOv5s

Zhenguo Lu ¹, Zhibo Zhu ¹, Weipeng Xu ^1,^2,^✉, Guixian Li ¹, Jinyang Chen ¹

PMCID: PMC11522521 PMID: 39472480

Abstract

Detecting small traffic signs poses significant challenges due to the complex nature and dynamic conditions of real-world traffic scenarios. In response to these challenges, we propose an improved YOLOv5s target detection model incorporating the multilevel squeeze feature perception (ML_SAP) mechanism, aiming to increase the accuracy of small traffic sign detection. First, an additional detection layer is incorporated to enhance the model’s capacity for detecting small-scale traffic signs. This improvement is accompanied by the adoption of a WIoU loss function, which evaluate the quality of anchor boxes. Moreover, the ML_SAP mechanism is designed to promotes the fusion and extraction of features at different levels. This mechanism effectively increases the network model’s proficiency in identifying small targets under varying environmental conditions. To verify the effectiveness of the improved method, we conducted extensive experiments on two public transportation sign datasets. Notably, on the challenging samples in the CCTSDB-2021 dataset, the improved model achieves a detection recall of 77.1%, which is 5.4% higher than that of YOLOv5s, and a mean average precision (mAP) of 82.7%, which is 3.9% higher than that of the base model. Furthermore, the model achieves a detection recall of 91.3% on the TT100K dataset, which is 3.7% higher than the performance of YOLOv5s, and a mean precision (mAP) of 91.5%, which is 4.6% higher than that of the base model.

Subject terms: Applied mathematics, Computer science

Introduction

In recent years, rapid advancements in artificial intelligence and machine vision have led to the widespread application of deep learning in driver assistance and autonomous driving. The traffic sign detection system plays a crucial role in providing real-time traffic information and supporting intelligent vehicle control systems, thereby increasing overall driving safety. Furthermore, research has increasingly focused on the recognition of traffic signs, which has emerged as a popular area of study. Currently, traffic sign detection methods can be broadly categorized into two groups: traditional algorithms and deep learning algorithms¹.

Traditional traffic sign recognition algorithms focus primarily on feature extraction and classification. This involves segmenting the color space², combining feature extraction methods based on the shape and edges of traffic signs, and achieving recognition through feature classification via classifiers. Early on, researchers concentrated on the distinctive shapes and vivid colors of traffic signs, leading to the development of numerous traditional detection methods based on handcrafted features by scholars interested in these characteristics³.

Nevertheless, challenges arise in the practical application of these methods. The development of feature extraction techniques requires considerable human resources. Furthermore, the inherent simplicity of these features renders them insufficiently robust for effective navigation in complex and dynamic traffic environments⁴.

Deep learning-based object detection algorithms are broadly categorized into two classes: one-stage detection algorithms and two-stage detection algorithms. Notable examples of the latter include R-CNN⁵, Fast RCNN⁶, and Faster R-CNN⁷. Conversely, prominent one-stage detection techniques include the single shot multibox detector (SSD)⁸, RetinaNet⁹, and You Only Look Once (YOLO) series¹⁰. Figure 1 shows the prediction process of YOLOv5.

Fig. 1 — The prediction process of YOLO.

Lim et al.¹¹ proposed a feature fusion structure and attention mechanism to enhance small target detection in their work. This approach introduces a trade-off between accuracy and efficiency. The additional network modules inevitably increase computational complexity, leading to a decrease in processing speed. Building upon YOLOv7, Li et al.¹² introduce a hybrid module that combines self-attention and convolution to strengthen the feature extraction capabilities of the backbone network. In addition, they incorporate omni-dimensional dynamic convolution as a novel feature extraction module specifically for small target detection. However, this approach lacks the necessary adaptability to effectively handle traffic signs with significant variations in size. Wang et al.¹³ proposed a feature tight fusion structure that improves detection accuracy for small targets and enhanced network feature extraction capability, while maintaining a low parameter footprint. However, their approach focuses solely on deep feature fusion, potentially neglecting the valuable information contained in shallow layers that is crucial for small target detection.

Although improvements have been made in small traffic sign detection^14,15, challenges remain. These include insufficient model feature extraction for real-world scenarios, inadequate retention of small target features, and inconsistency in image data quality. To address these limitations, we propose an improved YOLOv5s model based on the ML_SAP mechanism, which is specifically designed for the detection and recognition of small traffic signs. Our approach incorporates two key elements to enhance small target detection. First, we design a feature fusion structure that specifically introduces shallow information. This inclusion of shallow features complements deeper feature maps, allowing the model to focus on the finer details crucial for small target recognition. Second, we employ the WIoU loss function to address the challenge of data quality inconsistency. This robust loss function effectively balances the influence of images of varying quality, resulting in a reduction in regression errors. Our model demonstrates promising performance on various traffic sign datasets. Two key innovations are introduced here:

The proposed ML_SAP mechanism is integrated into the YOLOv5 neck component. This mechanism is designed to combine information from different levels and then apply squeeze processing, thereby enhancing the model’s ability to recognize key information.
The addition of a small target detection layer to the YOLOv5 network structure, specifically addresses the characteristics of small traffic sign targets. This enhancement improves the network’s focus on high-quality anchor frames by leveraging the WIoU¹⁶ loss function, contributing to overall improvements in detector performance and detection efficiency.

The following section outlines the overarching framework of the model, along with the enhancements made to facilitate small traffic sign target detection. The third section elaborates on the proposed ML_SAP mechanism, while the fourth section provides a comprehensive exposition of the experimental outcomes, accompanied by a thorough analysis.

Methodologies

Overview of methods

The proposed traffic sign detection model, as illustrated in Fig. 2, is specifically tailored for real-world road environments. Initially, the model uses a backbone architecture to extract pertinent features from the input image. A small target detection layer is then integrated into the neck segment to enhance the model’s ability to recognize small-scale targets efficiently. Further improvement is achieved through the devised ML_SAP mechanism, which facilitates the fusion of feature information across adjacent levels. Notably that the ML_SAP mechanism processes feature information using varying numbers of convolution kernels, enabling the accurate extraction of features related to small targets. Ultimately, the WIoU loss function is employed to reinforce the network’s focus on anchor frames of standard quality, thereby augmenting the overall detector performance and optimizing detection efficiency.

Improvement of the neck network

The standard YOLOv5 model employs a convolutional backbone that generates feature maps at various scales (C1-C5). The neck subsequently uses a feature pyramid network (FPN) and path aggregation network (PAN) structure to fuse feature information from these scales via concatenation, as shown in Fig. 3. However, this approach may lead to the loss of feature information for small targets, particularly in the later convolution layers, due to the shrinking feature area. To address this limitation, we propose an extension that incorporates additional detection layers and heads. These new components specifically fuse the C2 feature map from the backbone with an upsampled feature map obtained from the neck. This enhanced fusion strategy allows the model to better retain the characteristics of small targets, ultimately facilitating their detection in the final stage.

Fig. 3 — The neck network structure of YOLOv5.

Improvement of the loss function

The significance of the loss function lies in its capacity to gauge the disparities between the network model’s outcomes and the actual data, serving as a crucial metric for target detection effectiveness. Over the evolution of YOLO versions, from intersection over union(IoU) in the first generation to complete-IoU(CIoU) in the latest version of YOLOv5, localization loss has been pivotal in determining model quality. In this study, the WIoU loss function is used as a substitute for the CIoU loss function. It incorporates a strategic gradient gain allocation strategy to dynamically adapt to the properties of the anchor box. This strategy involves reducing the competitiveness of high-quality anchor boxes while simultaneously mitigating the adverse impacts of detrimental gradients from low-quality examples. It is defined as follows:

where Inline graphic is the outlier degree used to describe the quality of the anchor box.

where Inline graphic is the nonmonotonic focusing coefficient.

Multilevel squeeze feature perception

Given that traffic signs, as the focal point of detection in this study, typically occupy only a fraction of the entire image, enhancing the model’s sensitivity to feature information pertinent to these smaller targets becomes imperative. Consequently, we propose the multilevel squeeze feature perception (ML_SAP) mechanism to amalgamate feature information across adjacent levels, thereby increasing detection efficiency for small targets.

The ML_SAP mechanism leverages global pooling and squeeze processing to emphasize cross-channel feature information while employing a residual structure to comprehensively utilize information related to small targets. As shown in Fig. 4, the output feature maps (F1-F4) correspond to different scales for both the FPN and PAN structures, with Inline graphic - representing the feature maps of the corresponding scales after upsampling. Initially, a concatenation operation is applied to feature maps of identical sizes, followed by global average pooling. To further strengthen the fusion of channel-wise feature information between adjacent layers and promote the modeling of global features, we propose a sigmoid activation function, which is applied to the obtained feature map before element-wise multiplication with the original feature map.

Fig. 4 — The structure of multi-level squeeze feature perception.

To achieve dimensionality reduction of the fused feature information, the channel dimension is halved via a convolution operation. However, to retain informative features, the channel dimension is subsequently restored. This process can be interpreted as a feature selection step, as it emphasizes features with stronger responses in the convolution and reduces redundancy. Then, the final convolution with dimensionality reduction aims to identify features requiring greater focus from the two adjacent layers. The calculation formula for this module is as follows:

This mechanism facilitates the integration of feature information from adjacent levels within the network. It concurrently reduces redundant features, allowing the model to focus primarily on informative details. Importantly, it also retains features associated with small targets that require heightened attention. As a result, the mechanism achieves the global feature modeling while preserving critical details for accurate detection.

Experimental studies

Experimental environment and design

The experiments detailed in this article are conducted on the Windows 10 operating system, utilizing the PyTorch 2.0.1 framework with CUDA version 11.7. The hardware setup includes a 2080ti GPU model with 11 GB of graphic memory.

Two different datasets are used in this study to verify the generality of the model. The first dataset, designated CCTSDB2021¹⁷, is a well-known traffic sign dataset crafted by Changsha University of Science and Technology, and it is a prominent resource within the field of traffic sign research in China. This dataset meticulously categorizes common traffic signs encountered in daily life into three primary classifications: directional signs, prohibition signs, and warning signs. The training subset of CCTSDB2021 included 16,354 images, while the validation subset comprises 1,500 images. The second dataset, known as the TT100K¹⁸ dataset, is an extensive collection of traffic sign images developed collaboratively developed by Tsinghua University and Tencent. This dataset aggregates 100,000 high-definition images, each with a resolution of 2048 × 2048 pixels. To address the issue of imbalanced data distribution inherent in the TT100K dataset, preprocessing measures are implemented. Specifically, the dataset is refined by retaining the 45 categories with the highest number of annotations. Furthermore, signs prefixed with “i” are categorized as instruction signs, those beginning with “w” as warning signs, and those commencing with “p” as prohibitive signs. The detailed dataset division is presented in Table 1.

Table 1.

The 45 types of traffic signs used in the experiment.

Category	Sign name
Indicating signs	i2,i4,i5,il100,il60,il80,io, ip
Warning signs	w13,w32,w55,w57,w59,wo
Forbidden signs	p10,p11,p12,p19,p23,p26,p27,p3,p5, p6,pg, ph4,ph4.5,ph5,pl100,pl120,pl20, pl30,pl40,pl5,pl50,pl60,pl70,pl80,pm20, pm30,pm55,pn, pne, po, pr40

Open in a new tab

Experiment and effect of the detection model

In this study, a comparative evaluation of mean average precision (mAP) curves is conducted for several object detection models, including Faster-RCNN, SSD, YOLOv3¹⁹, YOLOv7-tiny²⁰,YOLOv5s, and a refined version of the YOLOv5s model proposed herein. The experimental results, depicted in Fig. 5, illustrated the performance of these models across different datasets. Tables 2 and 3 present specific data comparisons.

Fig. 5 — Comparison of effect of detection on different datasets: (a) on CCTSDB2021dataset (b) on TT100K dataset.

Table 2.

Performance comparison on CCTSDB2021 dataset.

Model	P (%)	R (%)	F1 (%)	FPS	mAP (%)	Warning (%)	Prohibitory (%)	Mandatory (%)
Faster-RCNN	40.5	84.5	55	27	75.2	80.6	69.9	75.3
SSD	92.6	41.4	57.3	150	79.8	82.6	76.6	80.3
YOLOv3	85.6	71.2	77.7	92	72.7	68.3	69.2	80.8
YOLOv5s	88.3	71.7	79.1	156	78.8	70.6	81.8	83.9
YOLOv7-tiny	84.3	63.6	72.5	91	73.0	69.7	74.0	75.3
Ours	89.4	76.4	82.4	125	82.7	73.5	87.2	87.3

Open in a new tab

Table 3.

Performance comparison on TT100K dataset.

Model	P (%)	R (%)	F1 (%)	FPS	mAP (%)	Forbidden (%)	Indicating (%)	Warning (%)
Faster-RCNN	56.9	77.2	65.2	26	73.4	79.6	65.8	74.8
SSD	83.3	48.4	61.2	142	64.6	67.6	60.1	66.2
YOLOv3	85.9	84.4	85.1	72	85.9	91.8	85.4	80.4
YOLOv5s	84.0	87.6	85.7	87	86.9	93.1	90.6	77.0
YOLOv7-tiny	74.5	71.7	73.1	60	70.2	80.9	76.2	53.4
Ours	86.6	91.3	88.9	68	91.6	94.2	95.7	85.1

Open in a new tab

In Fig. 5a, the mAP curves for the aforementioned models on the CCTSDB2021 dataset are presented. The analysis reveals that the proposed model outperforms the original YOLOv5s model in terms of precision (P), recall (R), and mean average precision (mAP), with improvements of 1.1%, 4.7%, and 3.9%, respectively.

Similarly, in Fig. 5b, the mAP curves for the aforementioned models on the TT00K dataset are depicted. Once again, the refined model proposed herein demonstrates superiority over the original YOLOv5s model in terms of precision, recall, and mAP, with enhancements of 2.6%, 3.7%, and 4.4%, respectively. These advances highlight the efficacy of the proposed algorithmic enhancements. Furthermore, comparative analysis against alternative detection methods highlights the significant advantages conferred by our improvements, thereby validating the effectiveness of the proposed approach. As illustrated in Fig. 6, the proposed algorithm demonstrably outperforms the YOLOv5s model in traffic sign detection. Notably, the enhanced model exhibits significantly improved detection capabilities, particularly for small-scale signs. Unlike the YOLOv5s model, which struggles with such objects, the enhanced model consistently identifies these targets accurately. Additionally, the improved model shows higher confidence scores in its detection, further supporting the efficacy of the proposed algorithm.

Fig. 6 — Comparison of the visualization of traffic sign detection in scenes on CCTSDB2021 and TT100K: (a) Original image (b) Results of YOLOv5s (c) Our Results.

Ablation experiment

To conduct a comprehensive analysis of the efficacy of the proposed algorithm and determine the distinct contribution of each module to overall model performance, ablation experiments are conducted using the CCTSDB2021 traffic sign dataset, with YOLOv5s serving as the baseline for evaluation. Table 4 shows the specific results.

Table 4.

Ablation results.

	P (%)	R (%)	F1 (%)	mAP (%)
YOLOv5s	88.3	71.7	79.1	78.8
YOLOv5s + WIoU(A)	89.2	72.7	80.1	79.8
YOLOv5s + WIoU + additional layer(B)	89.3	73.5	80.6	81.2
YOLOv5s + WIoU + additional layer + ML_SAP(C)	89.4	76.4	82.4	82.7

Open in a new tab

The term “WIoU” denotes the integrated loss function that incorporates a gradient gain distribution strategy, while “additional layer” refers to the inclusion of a small target detection layer. Additionally, “ML_SAP” represents the feature fusion mechanism developed in this study.

In Experiment A, only the loss function of the original model is replaced with WIoU. A comparative evaluation against YOLOv5s reveals a 1% enhancement in F1 and a 1% improvement in mAP. This outcome underscores the effectiveness of the gradient gain assignment strategy in reducing the dominance of high-quality anchor boxes while mitigating adverse gradients associated with low-quality examples.

Experiment B builds on the findings of Experiment A and introduces an additional small target detection layer. This enhancement demonstrates that fine multiscale feature fusion within the neck structure expands the model’s receptive field, facilitating improved detection of small targets. Experiment C confirms that the ML_SAP mechanism developed in this study effectively increases model accuracy and enhances small target detection capabilities. The results substantiate that each introduced innovation contributes significantly to improving model performance. Figure 7 visually illustrates the superiority of our algorithm, further corroborating the effectiveness of the proposed algorithm.

Fig. 7 — Comparison of different structures in YOLOv5s: (a) mAP 0.5 curve (b) P-R curves.

Comparison with other similar models

The feature pyramid network (FPN) and path aggregation network (PAN) constitute integral components of the YOLOv5s detection model, facilitating multiscale fusion and hierarchical target detection. In this study, we assess the impacts of adaptive spatial feature fusion (ASFF)²¹, bidirectional feature pyramid network (BIFPN)²², and multilevel squeeze feature perception (ML_SAP) on model detection performance with the CCTSDB2021 dataset. These three methods are compared with the baseline FPN + PAN architecture in the original YOLOv5s model.

Figure 8 shows the impact of different structures on the performance of the YOLOv5s model. Our analysis reveals that the ML_SAP mechanism, proposed in this study, outperforms the other methods in terms of precision‒recall (P‒R) curves and the mean average precision (mAP) index. Specifically, the ASFF structure achieves a 3.4% improvement in recall compared to the FPN + PAN structure. This enhancement can be attributed to the use of 1 × 1 convolutions and feature fusion techniques within the ASFF framework, which better capture the contributions of various feature scales to the predicted feature map, thereby augmenting the network’s predictive capabilities. Additionally, the BIFPN structure results in a 2.9% increase in precision compared to the FPN + PAN configuration. This improvement stems from the introduction of bidirectional connections between adjacent levels of the feature pyramid, facilitating the integration of information across different pyramid levels and thereby enhancing model prediction accuracy. Furthermore, compared with the FPN + PAN structure, the ML_SAP configuration yields improvements in both precision and recall of 2.9% and 5.5%, respectively. Notably, the ML_SAP structure not only enables the fusion of features across different levels but also facilitates additional feature extraction along the channel direction, thereby enhancing the model’s versatility and accuracy.

Fig. 8 — Comparison of similar structures in YOLOv5s: (a) mAP 0.5 curve (b) P-R curves.

Conclusion

In this study, an enhanced model built upon the YOLOv5s model is introduced for small traffic sign detection. Through modifications to the loss function and the incorporation of a dedicated small target detection layer, the network’s ability to recognize features associated with small targets is improved. The multilevel squeeze feature perception (ML_SAP) mechanism is designed to prioritize the channel dimension of feature information through the fusion of adjacent features. A series of ablation experiments and comparative analyses demonstrate that the proposed model outperforms the original framework, particularly in terms of precision and recall metrics. Furthermore, the feature fusion mechanism developed in this study has notable advantages over alternative fusion methodologies. Nevertheless, it is imperative for the proposed model to address considerations pertinent to field deployment, including optimization of computational resources and improvement of operational speed.

Author contributions

ZL and ZZ wrote the main manuscript, WX and GL drew the figures and tables, JC added content, and all authors reviewed the manuscript.

Data availability

The public datasets CCTSDB2021 and TT00K used in this study can be found at: https://github.com/csust7zhangjm/CCTSDB2021 https://cg.cs.tsinghua.edu.cn/traffic-sign/tutorial.html. The source code for the algorithms used in this study is available in the GitHub repository https://github.com/ultralytics/yolov5.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Lillo-Castellano, J. M., Mora-Jiménez, I., Figuera-Pozuelo, C. & Rojo-Álvarez, J. L. Traffic sign segmentation and classification using statistical learning methods. Neurocomputing153, 286–299 (2015). [Google Scholar]
2.Liu, K. et al. FISS GAN: A Generative Adversarial Network for Foggy Image Semantic Segmentation. IEEE/CAA J. Autom. Sinica8, 1428–1439 (2021).
3.Ruta, A., Li, Y. & Liu, X. Real-time traffic sign recognition from video by class-specific discriminative features. Pattern Recognition43, 416–430 (2010).
4.Fleyeh, H., Biswas, R. & Davami, E. Traffic sign detection based on AdaBoost color segmentation and SVM classification. in Eurocon 2013 2005–2010 (IEEE, Zagreb, Croatia, 2013). 10.1109/EUROCON.2013.6625255.
5.Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. Preprint at 10.48550/arXiv.1311.2524 (2014).
6.Girshick, R. Fast R-CNN. Preprint at 10.48550/arXiv.1504.08083 (2015).
7.Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Preprint at 10.48550/arXiv.1506.01497 (2016). [DOI] [PubMed] [Google Scholar]
8.Liu, W. et al. SSD: Single Shot MultiBox Detector. in vol. 9905 21–37 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal Loss for Dense Object Detection. Preprint at 10.48550/arXiv.1708.02002 (2018) [DOI] [PubMed] [Google Scholar]
10.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Preprint at 10.48550/arXiv.1506.02640 (2016).
11.Rosas-Arias, L. et al. FASSD-Net: Fast and Accurate Real-Time Semantic Segmentation for Embedded Systems. IEEE Trans. Intell. Transport. Syst.23, 14349–14360 (2022). [Google Scholar]
12.Li, S., Wang, S. & Wang, P. A Small Object Detection Algorithm for Traffic Signs Based on Improved YOLOv7. Sensors23, 7145 (2023). [DOI] [PMC free article] [PubMed]
13.Wang, J. et al. Vehicle-Mounted Adaptive Traffic Sign Detector for Small-Sized Signs in Multiple Working Conditions. IEEE Trans. Intell. Transp. Syst. (2023) 10.1109/TITS.2023.3309644.
14.Bi, Z., Xu, F., Shan, M. & Yu, L. YOLO-RFB: An Improved Traffic Sign Detection Model. in Mobile Computing, Applications, and Services (eds. Deng, S., Zomaya, A. & Li, N.) vol. 434 3–18 (Springer International Publishing, Cham, 2022).
15.Han, Y., Wang, F., Wang, W., Li, X. & Zhang, J. YOLO-SG: Small traffic signs detection method in complex scene. J Supercomput (2023) 10.1007/s11227-023-05547-y.
16.Tong, Z., Chen, Y., Xu, Z. & Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. Preprint at 10.48550/arXiv.2301.10051 (2023).
17.Zhang, J. et al. CCTSDB 2021: A More Comprehensive Traffic Sign Detection Benchmark. Human-centric Computing and Information Sciences12, 289–306 (2022).
18.Zhu, Z. et al. Traffic-Sign Detection and Classification in the Wild. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2110–2118 (IEEE, Las Vegas, NV, USA, 2016). 10.1109/CVPR.2016.232.
19.Redmon, J. & Farhadi, A. YOLOv3: An Incremental Improvement. Preprint at 10.48550/arXiv.1804.02767 (2018).
20.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Preprint at 10.48550/arXiv.2207.02696 (2022). [Google Scholar]
21.Liu, S., Huang, D. & Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. Preprint at 10.48550/arXiv.1911.09516 (2019).
22.Tan, M., Pang, R. & Le, Q. V. EfficientDet: Scalable and Efficient Object Detection. Preprint at 10.48550/arXiv.1911.09070 (2020).
23.Belaroussi, R. & Tarel, J.-P. Angle vertex and bisector geometric model for triangular road sign detection. in 2009 Workshop on Applications of Computer Vision (WACV) 1–7 (IEEE, Snowbird, UT, USA, 2009). 10.1109/WACV.2009.5403030.
24.Song, L., Liu, Z., Duan, H. & Liu, N. A Color-Based Image Segmentation Approach for Traffic Scene Understanding. in 2017 13th International Conference on Semantics, Knowledge and Grids (SKG) 33–37 (IEEE, Beijing, China, 2017). 10.1109/SKG.2017.00014.
25.Li, Y., Yao, T., Pan, Y. & Mei, T. Contextual Transformer Networks for Visual Recognition. Preprint at http://arxiv.org/abs/2107.12292 (2021). [DOI] [PubMed]
26.Li, Y., Li, J. & Meng, P. Attention-YOLOV4: a real-time and high-accurate traffic sign detection algorithm. Multimed Tools Appl 82, 7567–7582 (2023).
27.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: Convolutional Block Attention Module. Preprint at http://arxiv.org/abs/1807.06521 (2018).
28.Hou, Q., Zhou, D. & Feng, J. Coordinate Attention for Efficient Mobile Network Design. Preprint at 10.48550/arXiv.2103.02907 (2021).
29.Rezatofighi, H. et al. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. Preprint at https://doi.org/10.48550/arXiv.1902.09630 (2019).).
30.Zheng, Z. et al. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Preprint at http://arxiv.org/abs/1911.08287 (2019).
31.Zhang, H., Chang, H., Ma, B., Wang, N. & Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. Preprint at https://doi.org/10.48550/arXiv.2004.06002 (2020).
32.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Preprint at 10.48550/arXiv.2004.10934 (2020).
33.Liu, K. et al. SynerFill: A Synergistic RGB-D Image Inpainting Network via Fast Fourier Convolutions. IEEE Trans. Intell. Veh. 9, 69–78 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Lillo-Castellano, J. M., Mora-Jiménez, I., Figuera-Pozuelo, C. & Rojo-Álvarez, J. L. Traffic sign segmentation and classification using statistical learning methods. Neurocomputing153, 286–299 (2015). [Google Scholar]

[CR2] 2.Liu, K. et al. FISS GAN: A Generative Adversarial Network for Foggy Image Semantic Segmentation. IEEE/CAA J. Autom. Sinica8, 1428–1439 (2021).

[CR3] 3.Ruta, A., Li, Y. & Liu, X. Real-time traffic sign recognition from video by class-specific discriminative features. Pattern Recognition43, 416–430 (2010).

[CR4] 4.Fleyeh, H., Biswas, R. & Davami, E. Traffic sign detection based on AdaBoost color segmentation and SVM classification. in Eurocon 2013 2005–2010 (IEEE, Zagreb, Croatia, 2013). 10.1109/EUROCON.2013.6625255.

[CR5] 5.Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. Preprint at 10.48550/arXiv.1311.2524 (2014).

[CR6] 6.Girshick, R. Fast R-CNN. Preprint at 10.48550/arXiv.1504.08083 (2015).

[CR7] 7.Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Preprint at 10.48550/arXiv.1506.01497 (2016). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Liu, W. et al. SSD: Single Shot MultiBox Detector. in vol. 9905 21–37 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal Loss for Dense Object Detection. Preprint at 10.48550/arXiv.1708.02002 (2018) [DOI] [PubMed] [Google Scholar]

[CR10] 10.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. Preprint at 10.48550/arXiv.1506.02640 (2016).

[CR11] 11.Rosas-Arias, L. et al. FASSD-Net: Fast and Accurate Real-Time Semantic Segmentation for Embedded Systems. IEEE Trans. Intell. Transport. Syst.23, 14349–14360 (2022). [Google Scholar]

[CR12] 12.Li, S., Wang, S. & Wang, P. A Small Object Detection Algorithm for Traffic Signs Based on Improved YOLOv7. Sensors23, 7145 (2023). [DOI] [PMC free article] [PubMed]

[CR13] 13.Wang, J. et al. Vehicle-Mounted Adaptive Traffic Sign Detector for Small-Sized Signs in Multiple Working Conditions. IEEE Trans. Intell. Transp. Syst. (2023) 10.1109/TITS.2023.3309644.

[CR14] 14.Bi, Z., Xu, F., Shan, M. & Yu, L. YOLO-RFB: An Improved Traffic Sign Detection Model. in Mobile Computing, Applications, and Services (eds. Deng, S., Zomaya, A. & Li, N.) vol. 434 3–18 (Springer International Publishing, Cham, 2022).

[CR15] 15.Han, Y., Wang, F., Wang, W., Li, X. & Zhang, J. YOLO-SG: Small traffic signs detection method in complex scene. J Supercomput (2023) 10.1007/s11227-023-05547-y.

[CR16] 16.Tong, Z., Chen, Y., Xu, Z. & Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. Preprint at 10.48550/arXiv.2301.10051 (2023).

[CR17] 17.Zhang, J. et al. CCTSDB 2021: A More Comprehensive Traffic Sign Detection Benchmark. Human-centric Computing and Information Sciences12, 289–306 (2022).

[CR18] 18.Zhu, Z. et al. Traffic-Sign Detection and Classification in the Wild. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2110–2118 (IEEE, Las Vegas, NV, USA, 2016). 10.1109/CVPR.2016.232.

[CR19] 19.Redmon, J. & Farhadi, A. YOLOv3: An Incremental Improvement. Preprint at 10.48550/arXiv.1804.02767 (2018).

[CR20] 20.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Preprint at 10.48550/arXiv.2207.02696 (2022). [Google Scholar]

[CR21] 21.Liu, S., Huang, D. & Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. Preprint at 10.48550/arXiv.1911.09516 (2019).

[CR22] 22.Tan, M., Pang, R. & Le, Q. V. EfficientDet: Scalable and Efficient Object Detection. Preprint at 10.48550/arXiv.1911.09070 (2020).

[CR23] 23.Belaroussi, R. & Tarel, J.-P. Angle vertex and bisector geometric model for triangular road sign detection. in 2009 Workshop on Applications of Computer Vision (WACV) 1–7 (IEEE, Snowbird, UT, USA, 2009). 10.1109/WACV.2009.5403030.

[CR24] 24.Song, L., Liu, Z., Duan, H. & Liu, N. A Color-Based Image Segmentation Approach for Traffic Scene Understanding. in 2017 13th International Conference on Semantics, Knowledge and Grids (SKG) 33–37 (IEEE, Beijing, China, 2017). 10.1109/SKG.2017.00014.

[CR25] 25.Li, Y., Yao, T., Pan, Y. & Mei, T. Contextual Transformer Networks for Visual Recognition. Preprint at http://arxiv.org/abs/2107.12292 (2021). [DOI] [PubMed]

[CR26] 26.Li, Y., Li, J. & Meng, P. Attention-YOLOV4: a real-time and high-accurate traffic sign detection algorithm. Multimed Tools Appl 82, 7567–7582 (2023).

[CR27] 27.Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. CBAM: Convolutional Block Attention Module. Preprint at http://arxiv.org/abs/1807.06521 (2018).

[CR28] 28.Hou, Q., Zhou, D. & Feng, J. Coordinate Attention for Efficient Mobile Network Design. Preprint at 10.48550/arXiv.2103.02907 (2021).

[CR29] 29.Rezatofighi, H. et al. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. Preprint at https://doi.org/10.48550/arXiv.1902.09630 (2019).).

[CR30] 30.Zheng, Z. et al. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Preprint at http://arxiv.org/abs/1911.08287 (2019).

[CR31] 31.Zhang, H., Chang, H., Ma, B., Wang, N. & Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. Preprint at https://doi.org/10.48550/arXiv.2004.06002 (2020).

[CR32] 32.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Preprint at 10.48550/arXiv.2004.10934 (2020).

[CR33] 33.Liu, K. et al. SynerFill: A Synergistic RGB-D Image Inpainting Network via Fast Fourier Convolutions. IEEE Trans. Intell. Veh. 9, 69–78 (2024).

PERMALINK

Enhancing small target traffic sign detection with ML_SAP in YOLOv5s

Zhenguo Lu

Zhibo Zhu

Weipeng Xu

Guixian Li

Jinyang Chen

Abstract

Introduction

Fig. 1.

Methodologies

Overview of methods

Fig. 2.

Improvement of the neck network

Fig. 3.

Improvement of the loss function

Multilevel squeeze feature perception

Fig. 4.

Experimental studies

Experimental environment and design

Table 1.

Experiment and effect of the detection model

Fig. 5.

Table 2.

Table 3.

Fig. 6.

Ablation experiment

Table 4.

Fig. 7.

Comparison with other similar models

Fig. 8.

Conclusion

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases