Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Dec 15;15:43769. doi: 10.1038/s41598-025-27403-0

A lightweight cross-scale EDS-DETR model for hazard detection in transmission corridors

He Su 1, Jiaomin Liu 1, Zhenzhou Wang 2,, Pingping Yu 2
PMCID: PMC12705665  PMID: 41398415

Abstract

Visual inspection technology has been widely employed for identifying external hazards in transmission corridors. However, Convolutional Neural Networks (CNNs) exhibit limitations in multi-scale target detection within complex environments while struggling to balance accuracy and lightweight design. To address these issues, we propose a lightweight cross-scale detection model, called EDS-DETR. First, we improve the ResNet18 backbone by integrating the efficient multi-scale attention module with partial convolutions to improve the computational efficiency and representational capability. Second, DySample is introduced into the encoder to efficiently restore feature resolution of inspection images with minimal computational cost, meanwhile preserving feature details and enhancing dynamic perception capability. Finally, we adopt the shapeIoU loss function to enhance the detection accuracy of external hazards and accelerate model convergence. Experiments on a self-built dataset show that the proposed EDS-DETR model achieves precision, recall, and mAP@0.5 scores of 91.4%, 85.1%, and 93.1%, respectively. Notably, our method exhibits significant advantages in model efficiency, requiring 13.4% fewer parameters and 14.8% reduced model size compared to baseline approaches. Furthermore, EDS-DETR achieves an inference speed of 190 FPS, which satisfies the real-time requirement. The experimental results prove the effectiveness and practicability of the EDS-DETR model, contributing to the enhancement of reliability and safety in power transmission.

Keywords: Hazard detection, Lightweight network, RT-DETR, Attention mechanism

Subject terms: Engineering, Electrical and electronic engineering

Introduction

In recent years, with rapid economic development and increasing demand for electricity, many countries have focused on constructing transmission lines1. However, as the coverage of transmission lines continues to expand, transmission corridors also face numerous safety hazards. These hazards mainly include illegal construction machinery entering hazardous zones or touching the transmission line due to non-standard high-altitude work, leading to damage to transmission equipment, power outages, and even accidental electrocutions2. Deep learning-based object detection has provided an efficient solution for identifying external hazards in transmission corridors. Mainstream deep learning-based object detection algorithms primarily employ Convolution Neural Network(CNNs), which can be categorized into one-stage and two-stage types3. Liu et al.4 proposed a deep learning-based method for detecting external damage to transmission lines. They trained an object detection model by annotating thousands of surveillance images, and their experimental results demonstrated that the method is robust in the face of changes in illumination and weather. To address the challenges of data imbalance and small object detection, Qu et al.5 introduced an Enhanced Online Hard Example Mining (E-OHEM) algorithm. Integrated with the Faster R-CNN framework, this algorithm improved the mAP by 0.5% on the PASCAL VOC2007 dataset. Leng et al.6 developed an anti-external damage system utilizing an improved R-FCN network, which accomplished real-time monitoring and intrusion warnings for transmission line environments.

In the aspects of model lightweighting and multi-scale feature fusion, Yu et al.7 proposed an improved CenterNet model by incorporating EfficientNet-B0 and an adaptive bidirectional feature pyramid network (AF-BiFPN), achieving a mean average precision (mAP) of 94.74% and increasing the detection speed to 29 FPS in a data set of engineering vehicles. Zou et al.8 constructed a dataset for external hazard detection in transmission line corridors based on YOLO-LSDW, enhancing mAP@0.5 by 3.4% while reducing model parameters through the integration of large separable kernel attention (LSKA) and a dynamic head module. Yu et al.9 further proposed a lightweight CER-YOLOv5s algorithm by embedding Ghost bottleneck structures and an enhanced receptive field module (ERM), attaining an mAP of 98.3% with a 27.8% reduction in parameters under complex power line scenarios. Liu et al.10 introduced a content-aware approach that integrates content-aware upsampling and downsampling modules (CADM/CAUM) into the GELAN framework, achieving a mean average precision (mAP) of 96.50% on a transmission line dataset and significantly improving multi-scale object localization accuracy.

In general, these studies collectively address the problem through single-stage and two-stage detection paradigms. Although single-stage detectors provide superior inference speed, two-stage methods exhibit higher accuracy11. However, CNN-based approaches in this domain still face two fundamental challenges. First, their inherent architectural biases (such as locality and translational invariance) constrain their capacity to capture global contextual dependencies, which are crucial for reliably identifying hazards that exhibit extreme scale variations, irregular shapes, and occlusion. This limitation frequently results in elevated false-negative and false-positive rates. Second, prevailing methods extensively depend on non-maximum suppression (NMS) for post-processing, which not only introduces additional computational complexity but can also incorrectly suppress valid detections. Moreover, in practical deployment scenarios, achieving an optimal balance between model lightweight design and high accuracy without compromising deployment efficiency remains a persistent challenge within existing architectural frameworks.

To simultaneously address the challenges of detection performance and deployment efficiency, we propose a novel lightweight detector—EDS-DETR, aiming to improve detection accuracy for multi-scale and irregularly shaped disaster targets while reducing computational cost. The main contributions of this paper are fourfold:

  1. Based on EMA12 and PConv13, we develop a lightweight backbone network named EPNet. This architecture significantly reduces computational redundancy and parameter amount, and enhances the multi-scale feature expression ability to optimize feature extraction in transmission corridor scenarios.

  2. We introduce a lightweight upsampling module, DySample14, into the encoder to update the upsampling operation of the model to improve the recognition ability of long-distance dynamic targets.

  3. We use the Shape-IoU loss function to achieve accurate bounding box regression. By incorporating the geometric attributes of the target bounding box into the loss computation, this approach specifically addresses the challenge of accurately localizing irregularly shaped obstacles and significantly improves the accuracy of detection results.

  4. We carry out a large number of experiments on the HDTrans dataset. Comprehensive qualitative and quantitative evaluations demonstrate that EDS-DETR achieves state-of-the-art detection accuracy while maintaining a better balance between performance and efficiency compared to mainstream benchmarks, confirming its effectiveness for external damage detection in transmission lines.

Rather than proposing entirely novel architectural components, we focus on the systematic integration and domain-specific optimization of existing advanced methods. Its core innovation lies in the careful selection, adaptive adjustment, and synergistic combination of existing components to address the unique challenges posed by hazard detection in transmission corridors.

Related works

In recent years, the application of Transformer15 architectures in real-time object detection has emerged as a significant research direction in the field of computer vision. Among them, the Detection Transformer (DETR) framework marks a paradigm shift by formulating object detection as a set prediction problem, thereby eliminating the need for manually designed components such as anchor boxes and NMS16. However, DETR’s performance is often limited, particularly in detecting small objects within complex backgrounds. This limitation primarily stems from its reliance on convolutional backbones that output low-resolution feature maps, resulting in significant loss of spatial details critical for identifying small targets. Moreover, the global attention mechanism applied to such feature maps results in high computational complexity, making multi-scale feature processing challenging.

To address these limitations, several enhanced variants have been proposed. Deformable-DETR introduces a deformable attention module that focuses on sparse key sampling points, improving convergence and multi-scale detection accuracy17. However, it still faces efficiency limitations. Swin Transformer employs a hierarchical architecture with shifted windows to model long-range dependencies effectively18. However, its computational overhead remains substantial for real-time applications. While these variants emphasize improvements in accuracy and convergence, achieving real-time inference speed remains a widespread challenge for general Transformer-based detectors.

RT-DETR is a real-time-oriented variant specifically designed to overcome efficiency bottlenecks19. Its key innovations lie in an efficient hybrid encoder that adeptly leverages multi-scale features and an uncertainty-minimization query selection mechanism that accelerates the decoding process. This design enables RT-DETR to effectively address the speed-accuracy trade-off prevalent in many CNN and DETR models, establishing it as the state-of-the-art benchmark for real-time detection tasks20.

Given its demonstrated balance between accuracy and speed, we adopt RT-DETR as our baseline. However, for our specific application of real-time hazardous object detection in complex transmission corridors, the computational and memory demands of its end-to-end architecture are non-negligible. Hazards in these scenarios exhibit extreme scale variations, irregular shapes, and frequent occlusions, necessitating high model precision. Consequently, the core challenge and focus of our work is how to adapt this powerful architecture to operate under strict resource constraints without compromising its detection performance in these demanding situations.

Methods and methodology

EDS-DETR

Transmission corridor hazard detection poses significant challenges due to the diverse morphologies of targets, their random spatial distribution, and frequent motion-induced image artifacts. To address these issues, we introduce ESD-DETR, a novel lightweight architecture that maintains high detection accuracy while substantially reducing computational complexity. As depicted in Fig. 1, our model systematically addresses these challenges through three core innovations: (a) a backbone network for efficient multi-scale feature extraction, (b) a hybrid encoder for detail-preserving feature fusion, and (c) a shape-aware loss function for precise bounding box optimization.

Fig. 1.

Fig. 1

Overall architecture of the Efficient Deformable Sampling DETR (EDS-DETR) framework.

EPNet: Integrating the Efficient Multi-scale Attention (EMA) mechanism with partial convolutions (PConv), we develop the EPNet backbone that significantly enhances multi-scale feature extraction while reducing model parameters.

Efficient Hybrid Encoder: The hybrid encoder, comprising Attention-based Intra-scale Feature Interaction (AIFI) and Cross-Scale Feature Fusion (CCFF), transforms multi-scale features into image feature sequences19. We incorporate the DySample upsampler into the CCFF module to streamline upsampling and improve the detection of distant dynamic targets.

ShapeIoU: Replacing conventional GIoU(Generalized Intersection over Union)21 with ShapeIoU, our approach improves detection accuracy by precisely evaluating bounding box shape and scale matching, and accelerating model convergence.

Lightweight backbone with efficient multi-scale attention

The conventional RT-DETR model typically utilizes a ResNet-based backbone. This architecture processes all input channels during convolution, resulting in substantial computational overhead and limitations in effectively capturing multi-scale features22. To achieve a superior balance between representational capacity and computational efficiency in the backbone network, we introduce several enhancements to the RT-DETR-R18 baseline, constructing a novel EMA-PConv network (EPNet). The overall architecture of EPNet is depicted in Fig. 2.

Fig. 2.

Fig. 2

Overall architecture of the Efficient Pyramid Network (EPNet).

In Fig. 2(a), the input image first passes through a stem module composed of three consecutive Inline graphic convolutional layers. These layers progressively extract features ranging from low-level details to high-level abstractions. The features are then passed through a max-pooling layer to reduce their spatial dimensions and the computational burden for subsequent stages. The pooled features are input into the EPBlock module (see Algorithm A in the Supplementary Information), whose detailed structure is illustrated in Fig. 2(b).

Each EPBlock module consists of a PConv, two Inline graphic convolutional layers, and an EMA attention module, connected via a residual connection. The input feature map Inline graphic is first processed by the PConv operation, followed by two Inline graphic convolutional layers for channel compression and feature fusion. Normalization and ReLU activation are applied after the first convolution. Finally, the EMA module is applied to the output of the second Inline graphic convolution to perform multi-scale context aggregation, which is then added to the original input Inline graphic via a residual connection to produce the output feature Inline graphic.

Figure 2(c) illustrates the architecture of PConv, which reduces computational cost by performing standard convolution only on a selected subset of input channels. The floating-point operations (FLOPs) are calculated as follows:

graphic file with name d33e400.gif 1

Here, H and W are the spatial dimensions of the feature map, k is the convolution kernel size, and Inline graphic corresponds to the number of channels processed by the partial convolution. When Inline graphic is set to c/4, the FLOPs of PConv become only 1/16 of those of a standard convolution, thereby significantly reducing the computational complexity. To endow the model with strong multi-scale capabilities and overcome the limitations of standard convolution in capturing diverse contextual information, we integrate an EMA module after the final Inline graphic convolution in EPNet to aggregate multi-scale context. The architecture is illustrated in Fig. 3.

Fig. 3.

Fig. 3

Efficient multi-scale attention (EMA) module architecture.

The EMA module processes feature maps obtained from the preceding Inline graphic convolutions. The input features are partitioned into G distinct groups along the channel dimension. For each group, three parallel pathways are employed: two Inline graphic convolution branches and one Inline graphic convolution branch. The Inline graphic branches combine directional average pooling with shared convolutions to generate cross-channel attention weights. The Inline graphic branch captures local spatial context. The outputs from these paths are aggregated via element-wise multiplication followed by averaging, effectively fusing multi-scale information. Finally, a Sigmoid function is applied to produce the final attention weights, enabling adaptive optimization of features.

In conclusion, the proposed EPNet backbone employs PConv for selective channel filtering, which substantially improves feature extraction efficiency and directly reduces the computational burden. Moreover, the multi-scale attention mechanism of EMA enhances the model’s ability to integrate contextual information at different scales, which is crucial for detecting targets of disparate sizes and their performance under complex conditions. The synergy of these components makes EPNet a lightweight yet powerful backbone network.

Hybrid encoder enhancement with dynamic upsampling

In the hybrid encoder of RT-DETR, feature upsampling typically employs nearest-neighbor interpolation. Although computationally simple, this method disregards smooth transitions between pixels and relies solely on a limited number of neighboring pixels for prediction, often resulting in pixelation artifacts and loss of fine-grained details. Particularly in complex power transmission corridor scenarios, detecting distant or moving objects may lead to the loss of fine details due to pixel distortion during detection. Therefore, we integrate the DySample upsampling method into the cross-scale feature fusion (CCFF) module to address challenges related to computational efficiency and detection performance (see Algorithm B in the Supplementary Information). DySample adaptively adjusts sampling information based on local content within the input feature map, thereby better preserving detailed structures during upsampling—especially beneficial for detecting external damage targets in complex scenes. The DySample upsampling procedure is illustrated in Fig. 4.

Fig. 4.

Fig. 4

Architecture of the dynamic upsampling (DySample) module.

Specifically, a set of sampling points Inline graphic is first predicted, where each point encodes 2D spatial coordinates. Given a predefined input feature map Inline graphic, we apply the differentiable grid_sample operation23, yielding an upsampled feature map Inline graphic. This process is formally defined as:

graphic file with name d33e500.gif 2

The point sampling generation process is illustrated in Fig. 5. Given an upsampling factor s, a dynamic offset Inline graphic is generated via a linear layer combined with pixel shuffle24. This offset adjusts the sampling positions of pixels in the feature map, with its range modulated by both static and dynamic factors. To ensure training stability, a gating mechanism comprising a linear layer and a Sigmoid function restricts the offset magnitude within a learnable dynamic range Inline graphic. We adopt the configuration of Inline graphic and Inline graphic following Liu et al.14, which is empirically shown to balance detail preservation and computational efficiency. Subsequently, the sampling set Inline graphic is defined as the sum of the dynamic offsets Inline graphic and the original sampling grid Inline graphic:

graphic file with name d33e557.gif 3

By adaptively determining sampling locations based on input feature map content, DySample more effectively mitigates the loss of image details and produces sharper feature representations. This content-adaptive capability, combined with its lightweight offset predictor, renders DySample a computationally efficient solution for enhancing feature resolution within encoders.

Fig. 5.

Fig. 5

Architecture of the point sampling generator.

Shape-aware bounding box regression

The conventional RT-DETR employs the GIoU loss function for bounding box regression. Although GIoU effectively penalizes misalignment in both overlap area and center-point distance, it lacks an explicit mechanism to directly constrain the aspect ratio and scale of predicted boxes to match those of ground-truth boxes. This limitation may lead to reduced localization accuracy for irregularly shaped or significantly scaled hazardous objects commonly found in transmission corridor environments.

To introduce explicit geometric constraints into the learning process, we replace GIoU with the ShapeIoU loss function. ShapeIoU incorporates a penalty mechanism for shape and scale mismatches, thereby guiding the model to learn more accurate shape features during training. The complete loss function is defined as follows:

graphic file with name d33e568.gif 4

where Inline graphic denotes the actual intersection-over-union, while Inline graphic and Inline graphic represent the distance loss term and shape loss term, respectively. These terms are computed as:

graphic file with name d33e585.gif 5
graphic file with name d33e589.gif 6

where (xy) and Inline graphic denote the center coordinates of the predicted and ground-truth bounding boxes, respectively. c is a target-dependent constant factor, which is associated with the target to be detected. ww and hh represent the weight coefficients of the prediction box in the horizontal and vertical directions, respectively, with their values dependent on the shape of the ground truth box. Inline graphic and Inline graphic represent their relative width and height differences. The parameter Inline graphic defines the weight of the shape cost and is typically set to 425,26.

This design aims to alleviate the loss of detailed shape information during the regression process, a challenge that is particularly prominent when detecting objects against complex background images. Specifically, the ShapeIoU loss function is explicitly designed to improve the localization accuracy of irregularly shaped external threats by modeling geometric properties.

Experimental results and discussion

Dataset

Due to the absence of publicly available image datasets for transmission corridor hazards, we introduce the HDTrans dataset, constructed by integrating web-sourced data with targeted manual collection. This approach ensures comprehensive coverage of hazard scenarios from multiple perspectives and under diverse natural conditions. The dataset includes critical hazard objects such as excavators, cranes, loaders, and tower cranes. To counteract overfitting caused by limited and imbalanced samples, we implemented an extensive data augmentation pipeline incorporating Gaussian noise injection, adaptive brightness adjustment, Gaussian blurring, and affine transformations27. This process expanded the original 2,104 images to a robust dataset of 10,150 samples. The detailed composition and augmentation workflow of the HDTrans dataset are visualized in Fig. 6.

Fig. 6.

Fig. 6

Composition and augmentation strategy of the HDTrans dataset for transmission corridor hazard detection.

To evaluate model performance before and after improvements, we divided both the original and augmented datasets into training, validation, and test sets with an 8:1:1 ratio.

Experimental configuration and training dynamics

All experiments were conducted under a unified hardware and software environment, utilizing an NVIDIA RTX 3080Ti GPU (12 GB memory) and an Intel Xeon Gold 6430 CPU, with PyTorch 2.1.0 and CUDA 12.1.

To objectively assess the comprehensive detection performance of different algorithms, we employed precision (P, %), recall (R, %), mean Average Precision (mAP, %), Frames Per Second (FPS, frames/s), number of parameters (Params, M), and model size (Model size, MB) as evaluation metrics for the models9. P, R, and mAP measure detection accuracy, FPS reflects real-time performance, while Params and Model size assess computational complexity and deployment cost.

The model was trained for 300 epochs with a batch size of 32. The input image size was set to Inline graphic pixels, and the initial learning rate was configured as Inline graphic. As illustrated in Fig. 7, the bounding box regression loss (GIoU loss), classification loss, and objectness loss (L1 loss) decreased steadily, indicating stable convergence. After 200 epochs, the model achieved satisfactory performance, with the mean average precision converging close to 1.0, which demonstrates effective optimization and robust learning behavior.

Fig. 7.

Fig. 7

Training dynamics: progression of loss functions and evaluation metrics throughout the optimization process.

Experiment using 5-fold cross-validation

To rigorously assess the model’s generalization capability, we conducted a comprehensive five-fold cross-validation study. In each fold, four subsets were used for training, while the remaining subset served as the test set. This process was repeated five times to ensure that every sample participated in validation. We recorded four key performance metrics across all folds: precision, recall, mAP@0.5, and inference speed. The calculated standard deviations were 0.1, 0.39, 0.19, and 0.83, respectively. The relatively larger standard deviation of recall indicates the model’s sensitivity to smaller and occluded targets, while the low standard deviations of precision and mAP@0.5 demonstrate classification stability. These statistical outcomes confirm that EDS-DETR achieves an optimal balance between detection comprehensiveness and operational stability through our architectural enhancements, as quantitatively evidenced in Fig. 8.

Fig. 8.

Fig. 8

Five-fold cross-validation results demonstrating model generalization performance and metric stability across different data partitions.

Ablation experiments

To comprehensively evaluate each module’s contribution to hazard detection in transmission corridors, we sequentially integrated components into the RT-DETR model under controlled experimental conditions. As detailed in Table 1, the ablation study employs six core metrics: P, R mAP@0.5, Params, Model size, and FPS, with “Inline graphic” indicating module activation.

Table 1.

Ablation study on the contribution of individual components.

Model PConv EMA DySample ShapeIoU Params (M) Size (MB) P (%) R (%) mAP@0.5 (%) FPS
Base 19.87 38.6 87.9 80.4 87.9 141
A Inline graphic 16.96 32.7 87.6 81.3 87.4 135
B Inline graphic 20.15 38.8 88.9 82.1 90.0 127
C Inline graphic 19.88 38.6 91.2 84.7 90.3 165
D Inline graphic 19.88 38.6 91.1 83.5 89.9 183
A+B Inline graphic Inline graphic 17.20 32.9 90.6 83.8 91.3 131
A+B+C Inline graphic Inline graphic Inline graphic 17.20 32.9 90.9 84.5 92.1 175
Ours Inline graphic Inline graphic Inline graphic Inline graphic 17.20 32.9 91.4 85.1 93.1 190

From the ablation experiment results, it is evident that each module gradually enhances its detection capability. Specifically, the baseline model achieves an mAP of 87.9% with 19.87M parameters. A represents the RT-DETR model improved by PConv, which reduces parameters by 14.6% and model size by 15.3%, achieving significant complexity reduction despite a 0.5% mAP@0.5 decrease. B represents the introduction of an EMA mechanism into conventional convolutional blocks, which enhances feature extraction, improving mAP@0.5 by 2.1% with a slight parameter increase. This demonstrates that improvements to the backbone network can augment the model’s feature extraction capabilities, thereby improving the network’s detection accuracy. C represents the implementation of DySample, which significantly improves precision and recall and provides a small improvement in mAP@0.5, despite the cost of increased computational load. D represents replacing the GIoU loss function for localization regression with ShapeIoU, which accelerates convergence and improves mAP@0.5 without architectural changes. The synergistic combination of PConv and EMA increases mAP@0.5 by 3.4% while reducing parameters by 2.67M, demonstrating EPNet’s dual advantages in accuracy and efficiency. After all the optimization strategies, our model achieves the best performance at 17.2M parameters, with P, R, and mAP@0.5 increasing by 3.5%, 4.7%, and 5.2%, respectively, compared with the benchmark network, and improving the model detection speed.

In addition, the significant FPS improvement over RT-DETR-R18 (190 vs. 141) is primarily attributed to the parameter reduction achieved by integrating PConv into the backbone network, which reduces computational redundancy while maintaining representational capacity. DySample contributes to the speedup indirectly by enabling the use of fewer parameters to achieve comparable feature representation quality, though its direct computational overhead is minimal due to its efficient point sampling design. All FPS measurements were conducted under identical hardware (NVIDIA RTX 3080Ti) and software (PyTorch 2.1.0, CUDA 12.1) environments to ensure comparability.

This ablation study not only validates each module’s efficacy but also confirms that the algorithm proposed can simultaneously meet the requirements of lightweight and high-accuracy for external hazards detection.

Ablation study on architectural components

To evaluate the advantages of different modules, we selected various upsampling techniques, attention mechanisms, and loss functions under identical experimental conditions. Table 2 summarizes the detection performance of the functional modules in our proposed EDS-DETR on the HDTrans dataset. Specifically, the attention mechanisms include: SENetV128, SENetV229, LSKA30, and Triple31. For the upsampling module, comparisons CARAFE32. Regarding loss functions, comparisons include MPDIoU33 and Focaler-IoU34.

Table 2.

Ablation Study of Architectural Components on Baseline Performance.

Configuration Component Params (M) P (%) R (%) mAP@0.5 (%) FPS
Baseline 19.87 87.9 80.4 87.9 141
+ Attention SENetV1 19.97 86.7 82.5 88.3 130
SENetV2 20.23 88.7 80.9 88.7 133
LSKA 20.61 88.4 80.8 88.6 115
Triple 19.89 87.1 83.1 88.8 108
EMA 20.15 88.9 82.1 90 127
+ Upsampling CARAFE 19.58 91.0 82.5 91.2 135
DySample 19.88 91.2 84.7 90.3 165
+ IoU MPDIoU 19.88 90.9 80.3 89.7 190
SIoU 19.88 88.9 84.5 89.7 168
ShapeIoU 19.88 91.1 83.5 89.9 183

The comparison results in Table 2 indicate that the proposed method achieves superior overall performance on the test set compared to other models.

Specifically, in the ablation study on attention mechanisms shown in Table 2, the EMA module demonstrates superior performance among the evaluated attention variants. With only 20.15M parameters, EMA achieves the highest mAP@0.5 of 90.0%, outperforming SENetV2 (88.7%) and LSKA (88.6%). This improvement can be attributed to EMA’s effective cross-scale feature integration capability, which is particularly beneficial for multi-scale object detection in transmission corridor environments. While LSKA may introduce background noise and SENetV2 overlooks spatial dependencies, EMA maintains a better balance between precision (88.9%) and recall (82.1%). Furthermore, EMA maintains efficient inference at 127 FPS, comparable to the baseline (141 FPS) and superior to other attention mechanisms like Triple Attention (108 FPS).

In the ablation study on upsampling modules presented in Table 2, the DySample module achieves the best trade-off between accuracy and efficiency, attaining 90.3% mAP@0.5 with 165 FPS—significantly higher than CARAFE (135 FPS) and the baseline upsampling approach (141 FPS). Compared to CARAFE, DySample improves recall by 2.2 percentage points (84.7% vs. 82.5%) while maintaining comparable precision (91.2% vs. 91.0%). This demonstrates DySample’s effectiveness in preserving feature integrity, particularly for distant or small objects in transmission corridor imagery.

In the ablation study on loss functions shown in Table 2, ShapeIoU achieves the highest mAP@0.5 (89.9%) among the evaluated losses, improving over MPDIoU (89.7%) and SIoU (89.7%). Specifically, ShapeIoU enhances recall by 3.2 percentage points compared to MPDIoU (83.5% vs. 80.3%), effectively reducing missed detections for complex-shaped mechanical hazards along power transmission lines. Despite the accuracy improvements, all IoU variants maintain high inference speeds exceeding 168 FPS, ensuring real-time detection capability.

Comparison experiments

To assess the performance of the EDS-DETR model, we compared the algorithm proposed in this paper with Faster RCNN, YOLO series models, and the original RT-DETR algorithm. All comparative experiments were performed under identical datasets and configurations, with a primary focus on several aspects: performance metrics (P, R, mAP@0.5, mAP@0.5:0.95), model complexity(Params), and detection speed (FPS).

As shown in the experimental results in Table 3, our model achieves excellent performance in terms of accuracy, recall, mAP@0.5, and mAP@0.5:0.95, outperforming all other models.

Table 3.

Performance comparison with state-of-the-art object detection models.

Model Params (M) P (%) R (%) mAP@0.5 (%) mAP@0.5:0.95 (%) FPS
Faster R-CNN35 41.4 54.4 46.8 74.7 59.4 38.6
SSD36 34.6 60.3 55.8 80.4 65.7 72.4
YOLOv8-s37 11.1 83.2 75.9 86.3 70.1 229.1
YOLOv9-C38 25.3 83.7 72.3 84.7 69.2 217.9
YOLOv10-S39 8.04 82.5 76.1 86.5 70.1 176.2
DETR-R5016 41.0 85.5 83.9 83.2 65.3 86.3
Deformable DETR17 40.0 80.1 84.2 84.5 68.2 55.9
RT-DETR-R1819 19.87 87.9 80.4 87.9 70.3 141.0
Ours 17.2 91.4 85.1 93.1 73.1 190.0

Comparative Analysis with Different Architectures: Experimental results highlight the inherent limitations of different network architectures in this task. The two-stage detector Faster R-CNN achieved the lowest accuracy and speed, with its performance constrained by the computationally expensive region proposal network and relatively poor robustness to extreme scale variations. Among single-stage detectors, the convolution-based YOLO series demonstrated excellent inference speed (176–229 FPS) due to their highly optimized architectures. However, its reliance on convolutional backbone networks with limited receptive fields may restrict their ability to model global context in complex scenes, resulting in lower accuracy Inline graphic compared to our method. This phenomenon highlights the challenges faced by pure convolutional neural network architectures in achieving both top-tier speed and high accuracy.

Comparative Analysis with Transformer Models: Transformer-based models, i.e., DETR and Deformable DETR, demonstrate comparable detection performance to the YOLO series but suffer from high model complexity and inferior detection speed. In contrast, the RT-DETR model achieves a better balance between model complexity and performance, making it more suitable for detecting external hazards in transmission corridors. Using the RT-DETR model as a benchmark, our improved method achieves an mAP@0.5 of 93.1% with only 17.2M parameters, demonstrating obvious advantages in detection speed.

The above experiments indicate that our model outperforms current mainstream object detection models across various performance metrics while maintaining a low parameter count. Although it does not achieve the optimal value in terms of FPS (Frames Per Second), it still performs well in detecting external hazards, significantly demonstrating its potential to advance external hazards detection technology in transmission corridors.

To provide a more intuitive comparison of the detection effects before and after improving the RT-DETR model, we selected images from four different scenarios in the test set for a comparative object detection experiment. In Scenario 1 (Fig. 9(a, e, i)), RT-DETR fails to detect small-scale objects, while EDS-DETR achieves accurate detection, thanks to the multi-scale attention in EMA and detail preservation in DySample. In Scenario 2 (Fig. 9(b, f, j)), RT-DETR produces false positives in cluttered backgrounds, whereas EDS-DETR correctly rejects distractors due to the geometric constraints of ShapeIoU. Scenarios 3 and 4 (Fig. 9(c, g, k) and Fig. 9(d, h, l)) demonstrate consistently superior performance, exhibiting higher confidence scores and more accurate localization. These visual results strongly support the quantitative metrics, fully validating the effectiveness of EDS-DETR in addressing core domain-specific challenges.

Fig. 9.

Fig. 9

Qualitative comparison of detection results on challenging scenes.

Conclusions and future work

This paper proposes a lightweight model named EDS-DETR to address the challenges of accuracy and efficiency in external damage detection within transmission corridors. To achieve a compact model design, Partial Convolution (PConv) is introduced into the ResNet18 backbone, significantly reducing model size and parameter count. To enhance feature representation, the Efficient Multi-scale Attention (EMA) module is integrated with PConv, and the DySample upsampler is adopted within the encoder. Experimental results demonstrate that these strategies work synergistically, substantially improving detection performance. Furthermore, the adoption of the Shape-IoU loss function effectively accelerates convergence and improves localization accuracy. On the HDTrans dataset constructed for this study, EDS-DETR achieves a recall of 85.1% and an mAP@0.5 of 93.1%, representing improvements of 4.7% and 5.2%, respectively, over the baseline model. The proposed model also outperforms other mainstream detectors, surpassing the standard YOLOv5s model by approximately 3% in mAP while operating at a higher frame rate. More importantly, the model maintains a high inference speed of 190 FPS, demonstrating its suitability for real-time deployment scenarios.

Although the results are encouraging, this study has certain limitations. The model was primarily trained and validated on a custom dataset, which may not encompass all environmental variations encountered in practical applications. Future work will focus on expanding the dataset to include more challenging scenarios, such as extreme weather conditions and occlusions. We also plan to explore extending the EDS-DETR model for video-based damage tracking tasks. In summary, this research provides a reliable and efficient method for transmission line inspection, establishing a solid foundation for subsequent studies in the field of smart grid maintenance.

Supplementary Information

Below is the link to the electronic supplementary material.

Author contributions

He Su: Conceptualization, Methodology, Writing – original draft. Jiaomin Liu, Zhenzhou Wang: Resources, Supervision, Validation, Formal analysis. Pingping Yu: Writing review and editing.

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-27403-0.

References

  • 1.Sharma, P., Saurav, S. & Singh, S. Object detection in power line infrastructure: A review of the challenges and solutions. Eng. Appl. Artif. Intell.130, 107781 (2024). [Google Scholar]
  • 2.Jinghan, H. et al. A research review on application of artificial intelligence in power system fault analysis and location. Proc. CSEE40, 5506–5516 (2020). [Google Scholar]
  • 3.Liu, M., Li, Y., Chen, Y., Qi, Y. & Jin, L. A distributed competitive and collaborative coordination for multirobot systems. IEEE Transactions on Mob. Comput.23, 11436–11448 (2024). [Google Scholar]
  • 4.Liu, J., Huang, H., Zhang, Y., Lou, J. & He, J. Deep learning based external-force-damage detection for power transmission line. In Journal of Physics: Conference Series, 1169, 012032 (IOP Publishing, 2019).
  • 5.Qu, L., Liu, K., He, Q., Tang, J. & Liang, D. External damage risk detection of transmission lines using e-ohem enhanced faster r-cnn. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), 260–271 (Springer, 2018).
  • 6.Leng, X., Dai, J., Gao, Y., Xu, A. & Jin, G. Overhead transmission line anti-external force damage system. In IOP Conference Series: Earth and Environmental Science, 1044, 012006 (IOP Publishing, 2022).
  • 7.Yu, P., Wang, H., Zhao, X. & Ruan, G. An algorithm for target detection of engineering vehicles based on improved centernet. Comput. Mater. & Continua 73 (2022).
  • 8.Zou, H. et al. Detection method of external damage hazards in transmission line corridors based on yolo-lsdw. Energies17, 4483 (2024). [Google Scholar]
  • 9.Yu, P., Yan, Y., Tang, X., Shang, Y. & Su, H. A lightweight cer-yolov5s algorithm for detection of construction vehicles at power transmission lines. Appl. Sci.14, 6662 (2024). [Google Scholar]
  • 10.Liu, M. et al. A content-aware method for detecting external-force-damage objects on transmission lines. Electronics14, 715 (2025). [Google Scholar]
  • 11.Fan, J. et al. Coevolutionary neural dynamics considering multiple strategies for nonconvex optimization. Tsinghua Sci. Technol2025, 9010120 (2025). [Google Scholar]
  • 12.Ouyang, D. et al. Efficient multi-scale attention module with cross-spatial learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023).
  • 13.Chen, J. et al. Run, don’t walk: chasing higher flops for faster neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12021–12031 (2023).
  • 14.Liu, W., Lu, H., Fu, H. & Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6027–6037 (2023).
  • 15.Vaswani, A. et al. Attention is all you need. Adv. neural information processing systems 30 (2017).
  • 16.Carion, N. et al. End-to-end object detection with transformers. In European conference on computer vision, 213–229 (Springer, 2020).
  • 17.Zhu, X. et al. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprintarXiv:2010.04159 (2020).
  • 18.Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
  • 19.Zhao, Y. et al. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16965–16974 (2024).
  • 20.Liu, M., Wang, H., Du, L., Ji, F. & Zhang, M. Bearing-detr: A lightweight deep learning model for bearing defect detection based on rt-detr. Sensors24, 4262 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Rezatofighi, H. et al. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 658–666 (2019).
  • 22.Liu, M., Chen, L., Du, X., Jin, L. & Shang, M. Activated gradients for deep neural networks. IEEE Transactions on Neural Networks Learn. Syst.34, 2156–2168 (2021). [DOI] [PubMed] [Google Scholar]
  • 23.Jaderberg, M., Simonyan, K., Zisserman, A. et al. Spatial transformer networks. Adv. neural information processing systems28 (2015).
  • 24.Shi, W. et al. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1874–1883 (2016).
  • 25.Wang, D. et al. Sds-yolo: An improved vibratory position detection algorithm based on yolov11. Measurement244, 116518 (2025). [Google Scholar]
  • 26.Liu, S. et al. Binocular localization method for pear-picking robots based on yolo-cds and rsiqr module collaborative optimization. Comput. Electron. Agric.239, 110962 (2025). [Google Scholar]
  • 27.Xu, M., Yoon, S., Fuentes, A. & Park, D. S. A comprehensive survey of image augmentation techniques for deep learning. Pattern Recognit.137, 109347 (2023). [Google Scholar]
  • 28.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7132–7141 (2018).
  • 29.Narayanan, M. Senetv2: Aggregated dense layer for channelwise and global representations. arXiv preprintarXiv:2311.10807 (2023).
  • 30.Lau, K. W., Po, L.-M. & Rehman, Y. A. U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert. Syst. with Appl.236, 121352 (2024). [Google Scholar]
  • 31.Misra, D., Nalamada, T., Arasanipalai, A. U. & Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 3139–3148 (2021).
  • 32.Wang, J. et al. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF international conference on computer vision, 3007–3016 (2019).
  • 33.Ma, S. & Xu, Y. Mpdiou: a loss for efficient and accurate bounding box regression. arXiv preprintarXiv:2307.07662 (2023).
  • 34.Gevorgyan, Z. Siou loss: More powerful learning for bounding box regression. arXiv preprintarXiv:2205.12740 (2022).
  • 35.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence39, 1137–1149 (2016). [DOI] [PubMed] [Google Scholar]
  • 36.Liu, W. et al. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 21–37 (Springer, 2016).
  • 37.Jocher, G., Chaurasia, A. & Qiu, J. Ultralytics yolov8 (2023).
  • 38.Wang, C.-Y., Yeh, I.-H. & Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In European conference on computer vision, 1–21 (Springer, 2024).
  • 39.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst.37, 107984–108011 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES