Abstract
In modern industrial construction sites, safety inspection tasks face challenges such as large-scale variations and complex multi-object detection scenarios, particularly in detecting critical safety indicators like flames, smoke, personnel attire, and operational behaviors. To address these limitations, this paper proposes YOLO-GL, an enhanced detection network featuring three key innovations: (1) a redesigned Parallelized Local-Global Multi-Level Fusion Module (C2f_gl) with a local-global attention mechanism for improved multi-scale feature representation; (2) a hierarchical feature fusion architecture; and (3) an adaptive feature fusion shuffling module (Multi-Scale Fusion Module and Adaptive Channel Shuffling Module, MSF&ACS) for dynamic optimization of cross-scale semantic relationships. Extensive experiments on composite datasets combining public benchmarks and industrial site collections demonstrate that YOLO-GL achieves state-of-the-art performance, improving mAP@0.5 by 3.5% (from 70.8% to 74.3%) and mAP@0.5:0.95 by 2.9% (from 38.7% to 41.6%) for flame detection, while maintaining real-time processing at 80.59 FPS. The proposed architecture exhibits superior robustness in complex environments, offering an effective solution for industrial safety monitoring applications.
Subject terms: Information technology, Computer science
Introduction
Industrial safety monitoring systems require precise detection of potential hazards, such as pre-ignition smoke signatures and personnel collapse incidents during high-risk operations. Recent frequent occurrences of fires and workplace accidents in factories have led to significant economic losses and severe threats to worker safety and corporate social responsibility. Current safety protocols rely heavily on manual supervision, which suffers from high latency, excessive labor costs, and limited monitoring coverage. These limitations highlight the urgent need for AI-powered detection frameworks with enhanced robustness and contextual awareness, particularly for complex industrial environments with dynamic illumination and occluded visual fields1.
Existing safety monitoring technologies can be categorized into three types: sensor-based, video-based, and deep learning-based detection. Sensor-based methods monitor environmental parameters like temperature and gas, while video-based approaches use cameras and computer vision algorithms to detect anomalies. Deep learning techniques leverage neural networks to extract high-level features, improving detection accuracy and robustness. However, each method has limitations: sensor-based systems are prone to environmental interference, video-based approaches struggle with false positives and negatives in complex scenes, and deep learning models demand substantial computational resources, hindering deployment in resource-constrained environments.
Object detection, a crucial subfield of computer vision, predicts target categories and locations in images with minimal latency, enabling applications such as autonomous driving and drone surveillance. Recent advancements in YOLO-series detectors2–10 have achieved remarkable efficiency-accuracy trade-offs through innovations including efficient feature extraction modules such as Cross Stage Partial Network (CSPNet)11 and EfficientRep12, as well as multi-scale fusion strategies such as Path Aggregation Network (PAN)13 and BiFPN14. Nevertheless, these methods face challenges in handling complex scenes, fine-grained targets, or multi-task scenarios.
Existing multi-scale feature fusion approaches, such as PAN and BiFPN, primarily focus on bidirectional feature aggregation. PAN employs a bottom-up and top-down pathway to enhance feature representation, while BiFPN introduces weighted connections to balance the contributions of different feature levels. However, these methods exhibit limitations in handling complex scenes, fine-grained targets, or multi-task scenarios due to their rigid fusion mechanisms and insufficient exploitation of hierarchical feature interactions.
In contrast, our proposed hierarchical feature fusion architecture incorporates triple-level feature interactions through the MSF&ACS module, which dynamically adjusts feature contributions based on channel-wise similarity and attention mechanisms. Specifically, MSF&ACS leverages attention-based channel shuffling to achieve more flexible and adaptive feature integration across different scales. This approach not only enhances the discriminative power of feature representations but also improves the model’s robustness in diverse detection scenarios, particularly excelling in challenging environments with fine-grained targets or complex backgrounds.
The C2f module in YOLOv8, while efficient, has limited capacity for deep modeling of local-global features, hindering accuracy-speed balance in cluttered backgrounds or dense target scenarios. Additionally, existing fusion methods rely on fixed concatenation operations, failing to adaptively adjust fusion strategies based on semantic correlations across multi-scale features. Redundant computational designs further constrain performance improvements.
To address these challenges, this paper proposes an enhanced detection network, YOLO-GL, with three key contributions:
Enhanced C2f_gl Module15: A redesigned feature extraction module incorporating a local-global attention mechanism to strengthen feature capture capabilities in scenarios with large target size variations and high target density.
Multi-Level Feature Fusion Network16: A hierarchical fusion architecture integrating triple-level feature fusion (shallow, middle, and deep) to enhance multi-scale detection capabilities in complex environments.
Multi-Scale Fusion Module and Adaptive Channel Shuffling Module (MSF&ACS)17: A dynamic fusion module enhancing semantic relevance and complementarity across multi-level features, improving detection accuracy and robustness.
Related work
Real-time object detection
Real-time object detection has garnered significant attention due to its critical role in applications requiring low-latency performance, such as autonomous driving, surveillance systems, and robotics. The YOLO-series models have emerged as dominant solutions in this domain, owing to their efficient architectures and superior performance. Early versions (YOLOv1–v3) established the foundational framework, comprising a backbone network, feature fusion neck, and detection head, laying the groundwork for subsequent improvements. YOLOv4 introduced CSPNet and an enhanced Path Aggregation Network (PAN), combined with advanced data augmentation techniques (Mosaic and CutMix) and multi-scale optimization strategies, significantly boosting detection accuracy and speed. However, it still faces limitations in handling complex backgrounds and fine-grained targets, particularly in detecting small and occluded objects. YOLOv6 further optimized the backbone and neck architecture through BiC and SimCSPSPPF modules, adopting self-distillation and anchor-assisted training strategies to enhance model performance. YOLOv7 proposed the E-ELAN module to improve gradient flow and explored cost-free enhancement techniques, achieving a better balance between efficiency and accuracy. Despite these advancements, these methods still encounter challenges in multi-task scenarios, necessitating more flexible feature fusion mechanisms. YOLOv8 integrated the C2f module for efficient feature extraction and fusion, setting a new benchmark in real-time detection. YOLO11 further refined the backbone and neck architecture by introducing the C3k2 and C2PSA components, optimizing the overall design and training process to improve both accuracy and processing speed. Nevertheless, existing methods still struggle with complex backgrounds and fine-grained targets, highlighting the need for more adaptive and robust feature fusion mechanisms. YOLOv1218 marks a pivotal advancement in the YOLO series by introducing a novel attention mechanism, A2C2f, which dynamically weights channels and spatial locations to enhance feature representation. This mechanism enables the model to better capture both local and global contextual information, addressing challenges in complex scenes and fine-grained target detection. By integrating attention mechanisms into the YOLO framework, YOLOv12 sets a new benchmark for real-time object detection, particularly in scenarios requiring robust feature fusion and multi-task adaptability.
End-to-end object detection
End-to-end object detection has emerged as a novel paradigm by eliminating handcrafted components and post-processing steps, thereby simplifying the detection pipeline. DETR pioneered this approach by leveraging a Transformer-based architecture and employing Hungarian matching for one-to-one prediction19. However, DETR suffers from slow convergence and poor performance in detecting small objects, limiting its practicality in real-time applications. Subsequent works have focused on addressing these limitations: Deformable-DETR accelerated convergence through deformable multi-scale attention, improving detection accuracy for small objects20; DINO integrated contrastive denoising and hybrid query selection to further enhance detection accuracy21; and RT-DETR balanced accuracy and latency by introducing a hybrid encoder and uncertainty-minimized query selection22. Inspired by these advancements, YOLOv10 adopted dynamic label assignment and redundancy-free prediction mechanisms, surpassing RT-DETR in both speed and accuracy. Despite these improvements, the computational overhead of Transformer-based architectures remains a challenge, particularly for high-resolution image processing, necessitating further innovations in efficiency optimization.
Feature fusion and optimization
Feature fusion strategies play a pivotal role in determining the performance of object detection models. Traditional methods like PAN and BiFPN employ fixed multi-scale fusion rules, often failing to adapt to dynamic scenes or diverse target scales. PAN enhances feature representation through bottom-up and top-down pathways, but its fusion mechanism lacks flexibility, making it difficult to handle complex backgrounds. BiFPN introduced weighted connections to balance contributions from different feature levels, but its high computational complexity limits its practicality in real-time applications. YOLOv8 adopted a concatenation-based fusion approach, which is computationally efficient but restricts the selective flow of information. Gold-YOLO addressed this limitation by introducing a Gather-and-Distribute (GD) mechanism, which globally aggregates multi-level features and injects contextual information into higher layers, thereby reducing information loss and improving detection accuracy across various scales. However, existing methods still struggle with fine-grained target detection and complex backgrounds, highlighting the need for more adaptive and context-aware fusion mechanisms. Our proposed approach aims to address these limitations by leveraging hierarchical feature interactions and dynamic attention mechanisms.
Localized attention mechanisms
Localized attention mechanisms have emerged as a key solution to address the high computational cost of global self-attention in vision transformers. Specifically, the Swin Transformer23 introduced local window attention with adaptive partitioning, significantly reducing computational complexity while maintaining high performance. Axial attention24 and criss-cross attention25 further optimized computational efficiency by operating along single dimensions such as rows or columns. The CSWin Transformer26 extended this approach by employing parallel horizontal and vertical stripe attention, achieving a better balance between efficiency and accuracy through simultaneous feature processing in both horizontal and vertical directions.
Although recent work27 has optimized attention mechanisms through local-global relationships, Mamba-based approaches28 remain suboptimal for real-time applications, primarily due to their high computational complexity and memory consumption. To address these limitations, we propose a streamlined local attention mechanism and seamlessly integrate it with the C2f module. This mechanism enhances computational efficiency and detection accuracy without complex architectural modifications by dynamically adjusting attention window sizes and selectively fusing features. Experimental results demonstrate that our approach excels in resource-constrained environments such as embedded devices, providing an efficient and practical solution for real-time object detection.
Methodology
In industrial application scenarios, target detection models are often required to operate on embedded devices or mobile applications, which typically have limited computational resources and memory, and demand high stability. YOLOv8, known for its superior performance and stability, is selected as the baseline model. As shown in Fig. 1, the improved YOLO-GL network architecture primarily consists of an input stage, a backbone network, a neck network, and a head network. The input stage handles image preprocessing, while the backbone network extracts features from the input. The fusion network integrates multi-scale features and transfers these features of varying scales to the detection layer for target detection.To overcome the limitations of current detectors in modeling local and global features and dynamically adapting fusion strategies to multi-scale semantic variations, we propose two modules:
Parallelized Local-Global Multi-Level Fusion Module (C2f_gl)
Multi-scale Adaptive Fusion Shuffling Module (MSF&ACS)
These modules complement each other while ensuring efficient use of computational resources.
Figure 1.
Overall network architecture diagram.
C2f_gl: local-global attention module
Traditional C2f modules are well-known for their lightweight design and efficiency in feature extraction for object detection models. However, different feature locations have varying effects on the model’s perception, and these unique pieces of information are worth preserving and enhancing. Despite their strengths, traditional C2f modules exhibit limitations in capturing global contextual information when processing complex backgrounds or dense object scenarios. To address this, we propose an enhanced C2f_gl module that integrates a Local Perception mechanism through parallel pathways to improve comprehensive feature representation, as illustrated in Fig. 2.
Figure 2.
Parallelized local-global multi-level fusion module.
Modular architecture
Given an input feature
, where
,
,
, and
denote batch size, input channels, height, and width, respectively, the module first applies a
convolutional layer to reduce the number of channels, thereby lowering the computational complexity. This operation yields intermediate features
, where
represents the compressed channel dimension (
: output channels;
: compression ratio). The intermediate features are split into two branches:
, where
. The
branch is processed through stacked residual units to extract deeper features while preserving original semantic information, resulting in
.
Algorithm 1.
Local-Global Attention (LGA) Module
Local-global attention and feature fusion
- Local Feature Partitioning: The input feature map
is partitioned into non-overlapping local patches of size
using an unfolding operation:
Spatial average pooling is then applied to each patch to extract local features:
1
where
2
represents the total number of patches. - Global Interaction for Mask Generation: A learnable global vector
interacts with local features
via cosine similarity:
A nonlinear function
3
modulates the similarity scores to generate a mask:
where
4
is defined as:
and
5
,
, and
denote the threshold, enhancement factor, and suppression factor, respectively. Although the global vector p remains static during inference, the dynamic interaction between p and local features through cosine similarity enables the model to adapt to varying input scenes. This dynamic adjustment, combined with the model’s ability to learn p during training, ensures that global semantics are effectively modeled even in complex scenarios such as fire scenes. The corresponding pseudocode implementation, as detailed in Algorithm 1 (Local-Global Attention (LGA) Module), describes the process of generating an attention mask through the interaction between local features and a global vector, and further applying it for feature enhancement and fusion. - Feature Reorganization and Fusion: The attention mask is applied to local features (
), which are then concatenated with multi-scale features from the residual branch. A
convolution compresses the fused features: 
6
Multi-scale adaptive fusion and adaptive channel shuffling module
To overcome the limitations of fixed concatenation in multi-scale feature fusion, we propose a dynamic fusion strategy that adaptively adjusts cross-scale feature weights. This module operates in two stages, as shown in Fig. 3. In the first stage, the Multi-Scale Fusion (MSF) module resizes and concatenates features from three different scales, effectively capturing both fine-grained and high-level semantic information, which is particularly beneficial for detecting objects of varying sizes. In the second stage, the Adaptive Channel Shuffling (ACS) module performs channel shuffling on the concatenated features guided by attention weights, enabling the model to dynamically focus on the most relevant features, thereby enhancing feature representation. Overall, this approach significantly improves the model’s detection performance in complex environments, particularly for targets that are blurred, partially occluded, or otherwise challenging to detect.
Figure 3.
Multi-scale fusion module and adaptive channel shuffling module (MSF&ACS).
Adaptive channel shuffling module
The Adaptive Channel Shuffling Module (ACS) dynamically rearranges channel groups of the input tensor through a learned attention mechanism, emphasizing critical features while suppressing irrelevant information. This design enhances feature representation by adaptively reweighting and reorganizing channels based on their semantic importance, which is particularly effective in complex scenarios where feature relevance varies significantly.
Given an input tensor
, the ACS first partitions channels into G groups with each containing
channels. This grouping strategy reduces computational complexity while preserving the structural integrity of the feature space, resulting in reshaped tensor dimensions:
![]() |
7 |
The channel shuffling process is guided by sequential operations: The input tensor is first flattened along spatial dimensions
![]() |
8 |
followed by normalization across the last dimension:
![]() |
9 |
Subsequent computation of inter-channel relationships generates a cosine similarity matrix
![]() |
10 |
To further optimize computational efficiency, group-wise averaging is applied to reduce the complexity of the similarity matrix:
![]() |
11 |
The processed similarities then undergo non-linear normalization through
,where the transformation in
incorporates the Sigmoid activation function defined in
.
A lightweight Multi-Layer Perceptron (MLP) is then employed to refine the attention weights:
![]() |
12 |
This step enhances the discriminative power of the attention mechanism by capturing higher-order interactions among channel groups. Channel reordering is performed based on the refined attention weights, generating a shuffled tensor
. This process ensures that channels with higher importance scores are prioritized, while less relevant channels are suppressed. Residual connection
preserves original features while enhancing gradient flow, producing final output
.
Multi-scale fusion module
The MSF module realizes the dynamic fusion of multi-resolution features before the ACS enhancement operation. Align three input feature maps { A, B, C } with different resolutions to match the dimension of B :
For feature maps with higher resolution than B, adaptive average pooling is applied:
![]() |
13 |
For feature maps with lower resolution than B, bilinear interpolation is used:
![]() |
14 |
Aligned features are concatenated along channel dimension:
![]() |
15 |
This concatenated output is processed through the ACS module before channel compression via 1
1 convolution
to achieve target dimensionality.
Mapping logic in MSF&ACS
The MSF&ACS module dynamically adjusts the fusion strategy based on the semantic relevance among shallow, middle, and deep features. Specifically, this module employs an adaptive channel selection mechanism to prioritize the most informative features at each hierarchical level. This mechanism ensures that the fusion process is optimized for the unique characteristics of each feature level, thereby enhancing the overall representational capacity of the network.
Shallow features are processed through the MSF module to preserve fine-grained details and enhance local feature representation. Middle features are aligned and fused by integrating shallow features, deep features, and the SPPF module, achieving a balance between local and global information. This integration improves the detection capability for medium-sized objects. Deep features are processed through the MSF module to leverage high-level semantic information, thereby enhancing global contextual understanding and the detection of large objects.
In contrast, FPN propagates high-level semantic features to lower levels via a top-down pathway, but it lacks the ability to dynamically adjust feature weights based on semantic relevance. Similarly, PANet enhances feature fusion by incorporating a bottom-up pathway, but it still relies on static fusion rules.
By clearly defining the mapping logic for shallow, middle, and deep features, YOLO-GL ensures a comprehensive and adaptive feature fusion process. This approach maximizes detection performance across diverse scenarios, addressing the challenges posed by varying object scales and complex backgrounds. The structured interaction among the three feature levels enables the model to achieve superior robustness and accuracy in object detection tasks.
Equations
Experimental setup
The YOLOv8 architecture was selected as the baseline model due to its empirically validated balance between detection accuracy and computational efficiency in real-time applications. To enhance feature representation capabilities, we implemented our proposed local-global attention branch and multi-scale fusion mechanism, with comprehensive ablation studies conducted under identical training-from-scratch protocols. Experiments were executed in a CUDA-accelerated environment comprising Python 3.8.19, PyTorch 2.11, and CUDA 12.0, supported by an Intel®Core
i7-14700KF CPU, NVIDIA GeForce RTX 4090 GPU (24GB VRAM), and 64GB DDR4 RAM. All experiments strictly maintained consistent hyperparameter settings as detailed in Table 1.
Table 1.
Hyperparameter configurations.
| Hyperparameter | Range | Value |
|---|---|---|
| Learning_rate | {0.1,0.01,0.001} | 0.01 |
| Momentum | (0,1) | 0.937 |
| Weight_decay | (0,1) | 0.0005 |
| Warmup_epoch | {1,2,3,4} | 3 |
| Batchsize | {1,2,4,8,16} | 8 |
| Epoch | {100,200,300} | 300 |
| Imagesize | {320,640,1280} | 640 |
| Close_mosaic | {0,5,10,15} | 10 |
| Workers | {0,1,2,3,4,5,6,7,8} | 4 |
| Optimizer | {SGD,Adam,AdamW} | AdamW |
Data set construction
To address the lack of suitable public datasets for industrial safety monitoring, we developed a dedicated dataset by integrating public sources (flame, helmet datasets) with self-collected images captured under varying lighting conditions (natural/low-light) and perspectives (horizontal/elevation/depression) in real industrial environments. The initial 13,068-image collection (5,802 flames, 5,546 hot work, 1,720 workwear) was further enriched by incorporating additional multi-scenario flame images to enhance the effectiveness of flame detection. The dataset underwent rigorous augmentation including random rotation, HSV saturation adjustment, and noise injection, followed by manual curation to ensure quality. The dataset is divided into training, testing, and validation sets using stratified sampling based on scene characteristics such as lighting conditions, viewpoints, and target density, ensuring a balanced representation of heterogeneous scenarios. The dataset was split in an 8:1:1 ratio, annotated with bounding boxes and class labels using LabelImg, offering comprehensive coverage of industrial safety scenarios (Fig. 4).
Figure 4.
Partial display of the dataset.
To ensure comprehensive coverage of targets across varying scales and distances, the dataset was meticulously collected and filtered, encompassing a wide range of target sizes and diverse scene contexts. The specific proportional distribution of target sizes and their respective scale ranges within the dataset is illustrated in Fig. 5. This approach ensures a robust representation of industrial safety scenarios, enhancing the model’s adaptability to real-world applications.
Figure 5.
Target size distribution.
Evaluation metrics
To comprehensively evaluate the performance of the model, we employ several key metrics: Recall, mean Average Precision (mAP), the number of trainable parameters (Params), model size, and Frames Per Second (FPS). Recall, on the other hand, measures the proportion of correctly predicted positive instances relative to the total number of actual positive instances, as shown in Eq. (16):
![]() |
16 |
In industrial datasets, True Positive (TP) represents the number of targets correctly identified; False Positive (FP) indicates the number of targets incorrectly identified; and False Negative (FN) denotes the number of targets that were not detected.
For mAP (mean Average Precision), two primary metrics are used: mAP@0.5 and mAP@0.5:0.95. mAP@0.5 is the average precision calculated at an Intersection over Union (IoU) threshold of 0.5. When the IoU threshold ranges from 0.5 to 0.95, precision is computed at intervals of 0.05, and the mean of these precision values is defined as mAP@0.5:0.95. The corresponding formulas are as follows:
![]() |
17 |
![]() |
18 |
Params represents the total number of trainable parameters in the model, reflecting its complexity. It is calculated as shown in Eq. ( 19):
![]() |
19 |
where O denotes the constant order, K is the convolution kernel size, C represents the number of channels, M indicates the input image size, and i denotes the number of iterations.
Additionally, FPS (Frames Per Second) measures the number of image frames processed in one second, providing a direct indicator of the operational speed of the image detection model.
Comparison with state-of-the-art methods
As demonstrated in Table 2, our YOLO-GL model outperforms contemporary YOLO variants and classical detection networks, including Faster R-CNN29, Gold-YOLO, RT-DETR, YOLOv3, YOLOv5, YOLOv6, YOLOv8, YOLOv98, YOLOv10, YOLO11, YOLOv12 and FBRT-YOLO30. With a parameter count of 4.29 million, the YOLO-GL achieves state-of-the-art mean Average Precision (mAP) scores of 74.3% (mAP@0.5) and 41.6% (mAP@0.5:0.95), surpassing all compared models while maintaining computational efficiency. Notably, the model attains a detection speed of 80.59 FPS, meeting real-time requirements, and a recall rate of 70.4%, which exceeds most of the compared detection networks. YOLO-GL, with its smaller model size, surpasses the size of the S models in the YOLO series across generations. These results confirm the optimal balance of YOLO-GL between model compactness, efficiency, and detection accuracy.
Table 2.
Comparison with state-of-the-art models when selecting flame as the target, the model was compared with other YOLO variants and their derivatives in terms of mAP@0.5, mAP@0.5:0.95, parameter count, frame rate and recall rate.
| Model | Param.(M) | mAP@0.5 | mAP@0.5:0.95 | Recall | FPS |
|---|---|---|---|---|---|
| FasterRCNN-resnet50 | 41.1 | 58.3 | 31.9 | 60.1 | 18.02 |
| YOLOv3n | 4.05 | 70.1 | 37.8 | 66.5 | 69.57 |
| YOLOv3s | 15.32 | 72.0 | 39.2 | 62.7 | 43.77 |
| YOLOv5n | 2.50 | 72.3 | 39.0 | 66.3 | 92.14 |
| YOLOv5s | 9.11 | 72.8 | 39.6 | 70.8 | 57.97 |
| YOLOv6n | 4.24 | 71.0 | 38.6 | 62.1 | 111.31 |
| YOLOv6s | 16.30 | 72.4 | 38.6 | 66.8 | 87.55 |
| YOLOv8n | 3.01 | 70.8 | 38.7 | 65.5 | 110.64 |
| YOLOv8s | 11.12 | 72.6 | 40.9 | 69.8 | 78.58 |
| YOLOv9s | 7.17 | 73.0 | 40.1 | 64.9 | 75.82 |
| YOLOv10n | 2.70 | 69.9 | 37.6 | 62.5 | 99.55 |
| YOLOv10s | 8.03 | 67.2 | 36.4 | 64.5 | 73.56 |
| YOLO11n | 2.96 | 73.1 | 39.6 | 67.6 | 122.11 |
| YOLO11s | 9.41 | 73.4 | 40.5 | 68.1 | 73.26 |
| YOLOv12n | 2.56 | 72.8 | 40.0 | 67.2 | 111.69 |
| YOLOv12s | 9.23 | 73.4 | 40.2 | 68.5 | 70.18 |
| FBRT-YOLO-N | 0.85 | 70.7 | 37.7 | 66.4 | 122.08 |
| FBRT-YOLO-S | 2.89 | 71.2 | 37.8 | 67.2 | 90.92 |
| Gold-YOLO | 8.04 | 72.3 | 39.8 | 66.1 | 71.52 |
| RT-DETR-n | 16.79 | 67.3 | 36.0 | 64.2 | 47.62 |
| RT-DETR-resnet50 | 41.93 | 68.9 | 37.5 | 63.9 | 31.59 |
| YOLO-GL(Ours) | 4.29 | 74.3 | 41.6 | 70.4 | 80.59 |
All results were obtained without employing advanced training techniques such as knowledge distillation or PGI to ensure a fair comparison.
The use of diverse datasets for training and validation plays a crucial role in enhancing the performance and robustness of models. It significantly improves the generalization capability, performance, and robustness of object detection models, ensuring their effectiveness in various real-world applications. In this study, we conducted a comparative experiment between YOLOV8n and YOLO-GL on the VisDrone 2019 dataset, a publicly available large-scale drone vision dataset31. The VisDrone 2019 dataset, released by Tianjin University, consists of 10,209 static images captured by drones from various perspectives. It includes 6,471 images for training, 548 for validation, and 1,610 for testing, encompassing approximately 2.6 million object instances. The dataset covers 10 object categories, such as pedestrians, cars, and trucks, across 14 urban and rural environments in China.
As demonstrated in Table 3, the YOLO-GL model exhibits enhanced detection performance, with improvements of 0.4% and 0.3% in mAP@0.5 and mAP@0.5:0.95, respectively, compared to YOLOV8n. Analyzing the precision performance across different categories, the YOLO-GL model demonstrates superior detection capabilities for targets such as pedestrians, people, bicycles, and cars. These results highlight the advantages of YOLO-GL in detecting multiple objects in large-scale, complex, and diverse environments. Furthermore, we conducted a comparative analysis of the detection results between YOLOv8n and YOLO-GL, as illustrated in Fig. 6. In the first row of image comparisons, our model successfully detected the person on the motorcycle, which was missed by YOLOv8n. In the second image, YOLO-GL accurately identified the seated individual, demonstrating its enhanced capability in detecting occluded or complex postures. In the third image, YOLO-GL exhibited superior performance in detecting the target object, further underscoring its robustness in challenging scenarios. These observations reinforce the improved detection accuracy and reliability of YOLO-GL in diverse and intricate environments.
Table 3.
Comparison of YOLOV8n and YOLO-GL on the VisDrone dataset.
| Class | YOLOV8n | YOLO-GL | ||||||
|---|---|---|---|---|---|---|---|---|
| P | R | mAP@0.5 | mAP@0.5:0.95 | P | R | mAP@0.5 | mAP@0.5:0.95 | |
| All | 47.2 | 34.6 | 35.2 | 20.6 | 47.9 | 34.4 | 35.6 | 20.9 |
| Pedestrian | 47.4 | 36.6 | 37.8 | 16.5 | 44.4 | 37.8 | 37.8 | 16.5 |
| People | 51.3 | 25.1 | 29.3 | 10.7 | 56.6 | 23.0 | 30.0 | 11.1 |
| Bicycle | 29.6 | 9.63 | 9.54 | 3.94 | 29.7 | 9.87 | 9.92 | 4.05 |
| Car | 64.7 | 76.8 | 77.2 | 53.6 | 65.9 | 76.6 | 77.5 | 54.0 |
| Van | 52.2 | 38.5 | 40.8 | 28.2 | 52.0 | 40.6 | 41.3 | 28.9 |
| Truck | 47.3 | 30.9 | 30.9 | 19.7 | 48.5 | 30.8 | 31.6 | 20.2 |
| Tricycle | 38.0 | 27.0 | 23.5 | 12.9 | 38.8 | 27.0 | 24.2 | 13.5 |
| Awning-tricycle | 29.8 | 15.5 | 13.0 | 8.22 | 26.4 | 18.0 | 14.3 | 8.96 |
| Bus | 60.9 | 46.2 | 50.7 | 35.2 | 63.1 | 43.4 | 49.3 | 35.1 |
| Motor | 50.5 | 39.4 | 39.2 | 16.7 | 53.5 | 37.1 | 40.1 | 17.0 |
Figure 6.
The comparative detection results on the VisDrone 2019 Dataset are presented, where (a) represents the detection results of YOLOv8n and (b) represents the detection results of YOLO-GL.
As depicted in Table 4, we conducted a performance evaluation of the proposed C2f_gl module within the main network, comparing it with other versions such as the C3 module in YOLOv5, the C2f module in YOLOv8, and the C3k2 module in YOLO11, and the A2C2f module in YOLOv12. These five feature extraction modules were integrated into the YOLOv8 backbone for training. The experimental results indicate that our C2f_gl module achieved the best performance in terms of mAP@0.5, mAP@0.5-0.95, and Recall.
Table 4.
Precision comparison of the C2f module and its variants in the YOLOv8 baseline model.
| Module | Param. (M) | mAP@0.5 | mAP@0.5–0.95 | Recall | FPS |
|---|---|---|---|---|---|
| C2f | 3.01 | 70.8 | 38.7 | 65.5 | 110.64 |
| C3 | 2.68 | 71.9 | 39.3 | 64.5 | 124.26 |
| C3k2 | 2.53 | 73.0 | 39.7 | 66.2 | 130.55 |
| A2C2f | 3.11 | 73.1 | 39.4 | 68.2 | 107.64 |
| C2f_gl (Ours) | 3.43 | 73.1 | 40.1 | 68.9 | 101.32 |
The mAP@0.5 and mAP@0.5:0.95 for other target objects were compared with the original network,as illustrated in Fig. 7. Specifically, YOLO-GL showed enhancements in mAP@0.5 from 60.8% to 61.1% for smoke, 89.2% to 89.4% for polish, 92.7% to 94.5% for weld, 66.0% to 67.1% for fall, and 79.1% to 80.0% for clothes, while mAP@0.5:0.95 improved from 50.5% to 51.2% for smoke, 95.8% to 96.4% for polish, 53.8% to 58.3% for fall, and 67.4% to 68.9% for clothes. These results indicate that YOLO-GL consistently outperforms YOLOv8n across multiple categories, particularly in challenging scenarios such as fall and weld, where significant improvements were observed.
Figure 7.
Presents the comparison of mAP50 and recall rates for six target categories–smoke, normal, polish, weld, fall and clothes–with the YOLOv8n model.
For multi-object detection in complex scenarios, Fig. 8 demonstrates the superior capability of YOLO-GL in detecting distant, small, and blurry targets compared to YOLOv6n, Gold-YOLO, and YOLOv8n. The enhanced performance is attributed to the integration of the Parallelized Local-Global Multi-Level Fusion module (C2f_gl) and the Multi-Scale Fusion Module and Adaptive Channel Shuffling Module (MSF&ACS), which significantly improve detection robustness across diverse scenarios involving multi-class, multi-scale objects against cluttered backgrounds. These modules enable YOLO-GL to effectively capture fine-grained details and global contextual information, ensuring accurate and reliable detection even in challenging environments.
Figure 8.
Comparative analysis of multi-object detection performance across different models: (a) original image, (b) YOLOv6n, (c) Gold-YOLO, (d) YOLOv8n, and (e) YOLO-GL. Rows 1 and 2 focus on small or distant targets, while rows 3 and 4 demonstrate clothing and multi-person scenario detections.
Model analysis
Ablation study
Table 5 presents the ablation study results, which systematically evaluate the individual contributions of each proposed component. When only the C2f_gl module is introduced, both the mAP and recall rates show notable improvements compared to the baseline YOLOv8 model. Similarly, when the MSF&ACS fusion network is implemented independently, significant enhancements in mAP and recall are observed relative to YOLOv8. Notably, the highest performance is achieved when both the C2f_gl module and the MSF&ACS fusion network are integrated, resulting in the maximum values for mAP and recall. These findings demonstrate that each component contributes meaningfully to the model’s overall performance, with the combined architecture delivering the most substantial gains in detection accuracy and robustness. This systematic analysis underscores the effectiveness of the proposed modules and their synergistic impact on the model’s capabilities.
Table 5.
Ablation study results.
| Model configuration | C2f_gl | MSF | ACS | mAP@0.5 | mAP@0.5:0.95 | Recall |
|---|---|---|---|---|---|---|
| YOLOv8n (baseline) | 70.8 | 38.7 | 65.5 | |||
| YOLOv8n (C2f_gl) | ![]() |
73.1 | 40.1 | 68.9 | ||
| YOLOv8n (MSF + ACS) | ![]() |
![]() |
72.4 | 41.1 | 67.4 | |
| YOLO-GL (C2f_gl + MSF + ACS) | ![]() |
![]() |
![]() |
74.3 | 41.6 | 70.4 |
Converged network analysis
In the fusion network section, we conducted ablation studies on YOLOv6, YOLOv10, YOLO11, and YOLOv12. As shown in Table 6, the results demonstrate that our fusion network significantly outperforms the original networks in terms of mAP@0.5, mAP@0.5-0.95, and recall rate. Specifically, for YOLOv6, the mAP@0.5 increased from 71.0% to 71.7%, and the recall rate improved from 62.1% to 66.2%. For YOLOv10, the mAP@0.5 rose from 69.9% to 70.9%, and the recall rate increased from 62.5% to 65.5%. In the case of YOLO11, the mAP@0.5 improved from 73.1% to 73.4%, and the recall rate surged from 67.6% to 69.2%. Lastly, for YOLOv12, the mAP@0.5 remained stable at 72.8%, but the recall rate increased from 67.2% to 67.5%. These results highlight the effectiveness of our fusion network in enhancing the performance of YOLO variants across multiple metrics.
Table 6.
Ablation study on MSF&ACS modules in YOLO variants.
| Model | MSF&ACS | Param.(M) | mAP@0.5 | mAP@0.5:0.95 | Recall | FPS |
|---|---|---|---|---|---|---|
| YOLOv6 | 4.24 | 71.0 | 38.6 | 62.1 | 111.31 | |
![]() |
4.96 | 71.7 | 38.6 | 66.2 | 102.94 | |
| YOLOv10 | 2.70 | 69.9 | 37.6 | 62.5 | 99.55 | |
![]() |
3.24 | 70.9 | 38.9 | 65.5 | 90.08 | |
| YOLO11 | 2.96 | 73.1 | 39.6 | 67.6 | 122.11 | |
![]() |
3.84 | 73.4 | 41.0 | 69.2 | 107.64 | |
| YOLOv12 | 2.56 | 72.8 | 40.0 | 67.2 | 111.69 | |
![]() |
3.02 | 72.8 | 40.2 | 67.5 | 103.17 |
Parameter sensitivity analysis
Systematic experiments (Fig. 9) demonstrate that the C2f_gl module achieves maximum accuracy gain (73.1%) at optimal parameters (
,
, and
), highlighting the importance of parameter tuning for robust multi-object detection. Comparative feature visualization with YOLOv8 (Fig. 10) further reveals the proposed model’s superior perceptual capabilities in challenging tasks, where it effectively suppresses irrelevant regions while enhancing target features, outperforming the baseline in both robustness and precision.
Figure 9.
Parameter sensitivity analysis: From left to right: Threshold(T): Impact of threshold parameter t under fixed
and
, Alpha(
): Impact of balance coefficient
with constrained t = 0.4 and
, Beta(
) Impact of scaling factor
under t = 0.4 and
.
Figure 10.
Feature visualization comparison: (a) Original image (b) Activation maps from YOLOv8’s C2f module (c) Enhanced activations from YOLO-GL’s C2f_gl module.
Platform deployment
As demonstrated in Fig. 11, the YOLO-GL framework was successfully validated through field deployment using an integrated hardware system consisting of a Hikvision industrial spherical camera with environmental sensors and a portable computing unit (Intel®Core
i7-14650HX CPU, NVIDIA RTX 4070 GPU with 8GB GDDR6 VRAM, 16GB DDR5 RAM) running Python 3.8.19 with PyTorch 2.11 and CUDA 12.0 acceleration. Despite hardware limitations, the system achieved stable real-time detection performance exceeding 30 FPS, meeting operational requirements for industrial applications while demonstrating the framework’s practical deployability in resource-constrained scenarios.
Figure 11.
Equipment and field testing: The left panel (a) depicts the on-site testing of the equipment, while the right panel (b) presents a frontal view of the equipment.
Conclusion
This study improves industrial safety inspection accuracy and reduces false positives/negatives by optimizing the backbone and neck components of the YOLO framework. For feature extraction, we propose a Parallelized Local-aware Multi-Level Fusion module (C2f_gl), which significantly enhances detection precision for localized small targets. In feature fusion, a novel strategy incorporating a Multi-Scale Feature Fusion Shuffling module strengthens cross-layer semantic integration and optimizes channel arrangements. Experimental results demonstrate that YOLO-GL achieves competitive accuracy and recall rates. However, the current model exhibits suboptimal inference efficiency on low-power devices, necessitating further architectural refinement and quantization compression to balance computational efficiency and detection accuracy.
Acknowledgements
This work was supported in part by the Basic Research Project (Key Research Project) of the Education Department of Liaoning Province (No. JYTZD2023009), and in part by the Fundamental Research Funds for the Universities of Liaoning province (320224081).
Author contributions
X.C. and C.D. wrote the manuscript, organized the formulas, edited the figures and tables, and formatted the overall paper. T.C., G.L., and Y.R. reviewed the manuscript and provided critical suggestions. All authors reviewed and approved the final manuscript.
Data availability
The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Ku, B., Kim, K. & Jeong, J. Real-time isr-yolov4 based small object detection for safe shop floor in smart factories. Electronics11, 2348 (2022). [Google Scholar]
- 2.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788 (2016).
- 3.Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
- 4.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
- 5.Li, C. et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022).
- 6.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7464–7475 (2023).
- 7.Reis, D., Kupec, J., Hong, J. & Daoudi, A. Real-time flying object detection with yolov8. arxiv 2023. arXiv preprint arXiv:2305.09972 (2023).
- 8.Wang, C.-Y., Yeh, I.-H. & Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In European conference on computer vision, 1–21 (Springer, 2024).
- 9.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Adv. Neural. Inf. Process. Syst.37, 107984–108011 (2025). [Google Scholar]
- 10.Khanam, R. & Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 (2024).
- 11.Wang, C.-Y. et al. Cspnet: A new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 390–391 (2020).
- 12.Weng, K., Chu, X., Xu, X., Huang, J. & Wei, X. Efficientrep: An efficient repvgg-style convnets with hardware-aware neural network design. arXiv preprint arXiv:2302.00386 (2023).
- 13.Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8759–8768 (2018).
- 14.Tan, M., Pang, R. & Le, Q. V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10781–10790 (2020).
- 15.Xu, S. et al. Hcf-net: Hierarchical context fusion network for infrared small object detection. In 2024 IEEE International Conference on Multimedia and Expo (ICME), 1–6 (IEEE, 2024).
- 16.Wang, C. et al. Gold-yolo: Efficient object detector via gather-and-distribute mechanism. Adv. Neural. Inf. Process. Syst.36, 51094–51112 (2023). [Google Scholar]
- 17.Zhang, X., Zhou, X., Lin, M. & Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6848–6856 (2018).
- 18.Tian, Y., Ye, Q. & Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv preprint arXiv:2502.12524 (2025).
- 19.Carion, N. et al. End-to-end object detection with transformers. In European conference on computer vision, 213–229 (Springer, 2020).
- 20.Zhu, X. et al. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).
- 21.Zhang, H. et al. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605 (2022).
- 22.Zhao, Y. et al. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16965–16974 (2024).
- 23.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
- 24.Ho, J., Kalchbrenner, N., Weissenborn, D. & Salimans, T. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180 (2019).
- 25.Huang, Z. et al. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 603–612 (2019).
- 26.Dong, X. et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12124–12134 (2022).
- 27.Chu, X. et al. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural. Inf. Process. Syst.34, 9355–9366 (2021). [Google Scholar]
- 28.Liu, Y. et al. Vmamba: Visual state space model. Adv. Neural. Inf. Process. Syst.37, 103031–103063 (2025). [Google Scholar]
- 29.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst.28 (2015). [DOI] [PubMed]
- 30.Xiao, Y., Xu, T., Xin, Y. & Li, J. Fbrt-yolo: Faster and better for real-time aerial image detection. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39 (2025).
- 31.Du, D. et al. Visdrone-det2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF international conference on computer vision workshops, 0–0 (2019).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.



































