Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Sep 1;15:32060. doi: 10.1038/s41598-025-17186-9

YOLOv8-MCDE for lightweight detection of small instruments in complex backgrounds from inspection robots’ perspective

Tongxin Yang 1, Ding Ling 1, Qingwu Shi 1,, Tianyue Jiang 1,
PMCID: PMC12402103  PMID: 40890262

Abstract

This paper addresses the challenges of equipment inspection in complex substation environments by proposing a lightweight small object detection algorithm, YOLOv8-MCDE, specifically designed for instrument recognition and suitable for deployment on inspection robots. Through model structure optimization, the proposed method significantly enhances both the small object detection performance and real-time efficiency of instrument detection on edge computing devices. YOLOv8-MCDE adopts the lightweight MobileNetV3 architecture as its backbone, effectively reducing model complexity and improving operational efficiency. The neck integrates a CNN-based Cross-scale Feature Fusion (CCFF) algorithm, which further lowers computational overhead while enhancing detection capability for small objects. In addition, a Deformable Large Kernel Attention (D-LKA) mechanism is integrated to increase the model’s sensitivity to small objects within complex backgrounds. The conventional CIOU loss function is also replaced with the more efficient EIOU loss function, significantly improving bounding box localization accuracy and accelerating model convergence. Experimental results demonstrate that YOLOv8-MCDE achieves a Precision of 92.80% and an mAP50 of 91.36%, representing improvements of 2.38% and 1.27%, respectively, compared to the original YOLOv8. Furthermore, the proposed algorithm reduces FLOPs by 37.68% and model size by 36%. These enhancements substantially reduce computational resource demands while significantly improving the real-time detection capabilities and small object recognition performance of inspection robots operating in complex environments.

Keywords: Lightweight small object detection, YOLOv8, CNN-based Cross-scale Feature Fusion, Deformable Large Kernel Attention mechanism, EIOU

Subject terms: Electrical and electronic engineering, Mechanical engineering

Introduction

Instruments play a critical role in substations and other industrial environments by enabling real-time monitoring of equipment operational status, thus ensuring system safety and operational efficiency. In particular, pointer meters, due to their mechanical robustness, can still provide accurate measurements even under conditions of strong electromagnetic interference. However, current methods for reading pointer meters primarily rely on manual observation and recording, which is time-consuming, lacks real-time responsiveness, and is susceptible to errors caused by visual fatigue and distraction. These limitations significantly hinder the growing automation demands of industrial production. Therefore, developing automatic detection technologies suitable for practical industrial settings, capable of accurately and promptly reading meter information, has become an essential research direction in recent years13.

With the rapid advancement of machine vision and deep learning technologies, instrument detection based on image processing has gradually become a research hotspot. However, deploying these technologies on inspection robots poses numerous practical challenges4. Inspection robots typically operate in realistic substation environments, where the captured images are complicated by factors such as complex backgrounds, variable illumination conditions, reflections, and surface contamination. Additionally, due to varying camera angles and positions, instruments often appear small, tilted, or rotated within the images, significantly increasing the detection difficulty.Existing research mainly relies on datasets collected manually, characterized by relatively simple backgrounds and a large proportion of targets within the images, which fail to accurately represent the complex scenarios and small-object detection demands encountered in real-world inspections. Moreover, although some high-precision two-stage object detection methods achieve promising results, their large parameter count and slow inference speed prevent them from meeting the real-time and lightweight deployment requirements of edge devices. Even single-stage detection methods, such as the YOLO series, still have substantial room for improvement in accuracy and efficiency when handling small-object detection tasks under limited computational resources.

To address the aforementioned challenges, this paper proposes YOLOv8-MCDE – a lightweight and efficient small object detection algorithm tailored for complex environments encountered by inspection robots. The proposed method is designed to tackle real-world industrial issues such as the small size of instrument targets, strong background interference, and limited computational resources on edge devices.

Related work

Early instrument panel detection methods primarily relied on traditional image processing techniques. Zheng et al.5 extracted instrument regions by applying image binarization and contour analysis, followed by perspective transformation to correct tilted images, resulting in approximately frontal dial views suitable for subsequent analysis.However, these methods suffer from poor generalization, with detection accuracy degrading significantly under variations in instrument type, background, or camera angle. In addition, traditional approaches typically involve multiple independently hand-crafted steps, resulting in a complex pipeline that is difficult to integrate and maintain. Consequently, they fall short of meeting the real-time performance and maintainability demands of modern intelligent inspection systems.

In recent years, deep learning methods have become the mainstream approach in the field of instrument detection. Object detection algorithms based on convolutional neural networks have demonstrated significantly better performance compared to traditional methods. Among these, two-stage detection frameworks have gained considerable attention for their high accuracy. Liu et al.6 utilized Faster R-CNN to detect target instruments in captured images. Zuo et al.7 employed an improved version of Mask R-CNN to segment dial regions. In their approach, a CNN was first used for feature extraction, followed by a Region Proposal Network (RPN) to generate candidate regions likely to contain dials. Precise Region of Interest Pooling (PrRoIPooling) was then adopted in place of the original RoIAlign to obtain more accurate spatial feature representations. Finally, the model produced pixel-level binary masks of the dial areas, achieving precise localization and segmentation.However, despite their superior recognition accuracy, such two-stage algorithms are computationally intensive and exhibit slower inference speeds. Their computational complexity is typically about 20 times greater than that of one-stage algorithms, making them impractical for real-time deployment on resource-constrained edge devices.

In contrast, one-stage object detection algorithms demonstrate superior inference efficiency and lightweight design, making them more suitable for real-time applications in industrial environments8. Salomon et al.1 compared Faster R-CNN with the YOLO series of one-stage detectors and found that, although Faster R-CNN achieved slightly higher detection accuracy, it suffered from significantly higher computational complexity and slower inference speed. In comparison, the YOLO series offered a better trade-off between speed and accuracy, making it more practical for deployment on resource-constrained edge devices.Zhang et al.9 proposed a pointer reading method for water meters based on target–keypoint detection, employing an improved YOLOv4-Tiny network to detect and extract dial regions from images. This method effectively addressed challenges such as varying lighting conditions and diverse camera angles. Zhou et al.10 utilized the YOLOv5 network to localize instruments directly from full-frame images, leveraging the model’s ability to automatically extract deep features and generate bounding boxes that effectively distinguish instruments from the background. Xu et al.11 introduced PMD-YOLO, a lightweight model that integrates GhostNet and the Convolutional Block Attention Module (CBAM) to reduce model complexity. Additionally, they incorporated an enhanced Receptive Field Block (RFB) and a bidirectional Feature Pyramid Network (FPN) to improve detection performance for small and medium-sized targets. Hou et al.12 proposed a dial detection method based on the YOLOX framework, which utilizes an anchor-free detection mechanism. The backbone network extracts initial features, which are then fused across multiple scales using FPN. A decoupled prediction head is subsequently applied to generate feature maps containing bounding box information, enabling accurate dial localization. Jin et al.13 further developed the YOLOv8-o algorithm, which enhances detection robustness for tilted and rotated targets, thereby improving the accuracy of instrument localization and correction in real-world scenarios.Despite these advancements, existing studies still present several limitations. Most datasets used in prior work are manually collected, with large instrument targets and relatively simple backgrounds, which fail to accurately simulate the complex conditions faced by inspection robots–such as variable viewing angles, challenging lighting, strong background interference, and the presence of small targets. Furthermore, many studies primarily focus on detection success rates while overlooking the importance of bounding box localization accuracy, which directly affects the precision of subsequent perspective correction and reading calculations. In practice, precise localization can significantly enhance the accuracy of dial rectification and reduce the final reading error.

In summary, inspection robots tasked with automated instrument recognition in complex industrial environments face several critical challenges, including small target sizes, cluttered backgrounds, variable lighting conditions, and limited computational resources. To overcome these issues, this paper proposes YOLOv8-MCDE, a lightweight and accurate detection algorithm that achieves an effective balance between model precision and efficiency. The key scientific novelties and contributions of this work are as follows:

(1)To reduce the overall computational complexity and enable efficient deployment on edge devices, this paper adopts MobileNetV3 as the backbone of YOLOv8. MobileNetV3 integrates hardware-friendly MobileNet blocks, squeeze-and-excitation (SE) attention modules, and an improved nonlinear activation function (h-swish). This design significantly reduces the number of parameters and inference time while maintaining strong feature extraction capabilities, thereby enhancing the model’s adaptability to resource-constrained environments.

(2)To address the limitations of traditional FPN structures, which often lose spatial information when detecting small objects, we incorporate a CNN-based Cross-scale Feature Fusion (CCFF) algorithm into the neck of the network. This module facilitates bidirectional information exchange across different feature scales through a multi-branch convolutional design and lightweight lateral fusion strategy. As a result, it improves the complementarity between low-level semantic details and high-level contextual features, effectively enhancing recall for small instrument targets.

(3)In complex backgrounds, conventional convolutional operations typically suffer from limited receptive fields, hindering their ability to capture long-range dependencies. To mitigate this, we introduce a Deformable Large Kernel Attention (D-LKA) mechanism at key locations within the neck. By combining large-kernel convolutions with deformable convolution operations, this module expands the receptive field and adaptively focuses on irregular spatial regions. This improves the model’s robustness to target deformation, positional variation, and background noise in real-world inspection scenarios.

(4)We further incorporate an Efficient Intersection over Union (EIOU) loss function, which builds upon traditional IOU by explicitly modeling the center distance and the width-height discrepancies between predicted and ground-truth boxes. Compared with the original YOLOv8 loss function, EIOU offers faster convergence and higher localization accuracy while maintaining computational efficiency, making it particularly suitable for small object detection tasks in industrial applications.

Materials and methods

Dataset

Training uses the Meter Challenge (MC1296) dataset, which includes 1,296 images captured by inspection robots14. To enable the model to better adapt to substation industrial environments, the dataset includes complex backgrounds, multiple scales, various perspectives, and different types of instruments, as shown in Fig. 1.

Fig. 1.

Fig. 1

Example of dataset images.

In addition, to facilitate corrections after instrument panel detection, the original dataset was re-annotated to ensure that detection boxes closely align with the contours of the dials. This revision also addresses issues of incorrect and missing labels in the initial annotations, with the updated labels displayed in Fig. 2.

Fig. 2.

Fig. 2

Revise annotations.

To further enhance the localization accuracy of instrument panel detection, the original MC1296 dataset was expanded by adding 262 manually captured images of pointer instruments and 40 digital instrument images sourced from Roboflow, resulting in a total of 1,674 images, as shown in Fig. 3.

Fig. 3.

Fig. 3

The captured dataset and the dataset from Roboflow.

Simultaneously, to increase the model’s adaptability to real-world scenarios and its detection accuracy, random data augmentation was applied to these 1674 images, illustrated in Fig. 4. The augmentations included brightness adjustment, flipping, rotation, and salt-and-pepper noise. These augmentations introduced greater variation in lighting conditions and imaging angles, increasing the dataset size from 1,674 to 5,673 images. This effectively reduced the risk of overfitting and enhanced the model’s generalization capabilities. Experimental results presented in Table 1 indicate significant improvements across various evaluation metrics after integrating these additional datasets, highlighting their critical role in boosting the model’s detection and localization performance for instrument panels.

Fig. 4.

Fig. 4

Augmented images.

Table 1.

Comparison before and after expanding the dataset.

Dataset P/% R/% mAP
@0.5 @0.5–0.95
MC1296 90.22 85.60 88.40 75.70
MC1296 + Expanded Dataset 90.42 93.08 90.09 76.80

Standard YOLOv8 model

Since the introduction of YOLOv1 by Joseph Redmon in 2015, the YOLO series has gone through multiple iterations, continually improving detection speed and accuracy1521. In 2023, Ultralytics released YOLOv8, which was considered the state-of-the-art (SOTA) detection algorithm of that year. YOLOv8 inherits the classical framework of its predecessors and comprises three main components: the Backbone for feature extraction, the Neck for feature fusion, and the Head for classification and localization2224, as illustrated in Fig. 5.

Fig. 5.

Fig. 5

Structure of YOLOv8.

Although several newer versions of the YOLO algorithm have been released, the experimental results in the subsequent Comparison of Different Algorithm Types section demonstrate that YOLOv8 still delivers the best performance on our dataset. In this study, we adopt YOLOv8s, a lightweight parameter structure derived from the YOLOv8 algorithm that achieves an excellent balance between accuracy and computational efficiency for our application scenario.

The YOLOv8 backbone network, optimized based on CSPDarknet-53, efficiently extracts multi-scale features primarily through Conv, C2f, and SPPF modules. Specifically, the C2f module significantly enhances feature representation and optimizes gradient flow, while the SPPF module efficiently aggregates spatial features across various scales, simultaneously reducing computational complexity. In the neck component, YOLOv8 integrates Feature Pyramid Networks (FPN) and Path Aggregation Network (PANet) architectures, utilizing a bidirectional feature fusion strategy that combines top-down and bottom-up paths. This enhances the network’s detection capabilities for objects across different scales. The head module retains YOLO’s classic decoupled detection head structure, employing independent convolutional branches for classification and localization tasks, thereby improving detection accuracy and stability. Regarding the loss function, YOLOv8 comprehensively addresses classification, bounding box regression, and object confidence tasks. It uses Binary Cross Entropy (BCE) loss for evaluating class prediction accuracy and object confidence, and Complete Intersection over Union (CIOU) loss for assessing bounding box predictions. This integrated design enhances the training stability and inference efficiency of YOLOv8, achieving an effective balance between detection speed and accuracy.

Improved YOLOv8 model

Although YOLOv8 performs well in general object detection tasks, it still requires further optimization when applied to images captured by inspection robots. These images often contain complex backgrounds and small-sized instruments, increasing the difficulty of detection. Additionally, the original model has a large size and high computational complexity, which hinders real-time deployment on edge devices such as inspection robots in substations.

Therefore, we propose an improved, lightweight small-object detection algorithm based on YOLOv8, called YOLOv8-MCDE, whose architecture is shown in Fig. 6. Compared with the original YOLOv8, the modifications of YOLOv8-MCDE are illustrated in Fig. 7. The contributions are as follows:

Fig. 6.

Fig. 6

Structure of YOLOv8-MCDE.

Fig. 7.

Fig. 7

Comparison of YOLOv8 and YOLOv8-MCDE.

(1) The original backbone network of YOLOv8 is replaced with MobileNetV3, a lightweight neural network recognized for its superior accuracy and inference speed. MobileNetV3 leverages hardware-aware neural architecture search (NAS) and NetAdapt algorithms to optimize its inverted residual blocks, integrates lightweight Squeeze-and-Excitation (SE) attention modules, and employs the efficient Hard-Swish activation function. These enhancements significantly reduce model parameters and computational cost, ensuring effective feature extraction and suitability for rapid deployment on resource-constrained edge devices.

(2) Incorporating the CCFF algorithm concept into the neck structure of YOLOv8. CCFF employs a pyramid-based architecture to achieve bidirectional fusion of multi-scale feature maps, along with 1Inline graphic1 convolutions to efficiently adjust feature channel dimensions and reduce computational overhead. This approach effectively preserves detailed information from low-resolution features, significantly enhancing the detection performance for small-scale objects. Consequently, the proposed method maintains a lightweight structure and improves detection accuracy simultaneously.

(3) A Deformable Large Kernel Attention (D-LKA) mechanism is incorporated after selected C2f layers in the neck of YOLOv8. This attention module first expands the receptive field using depth-wise convolution and depth-wise dilated convolution, and then introduces dynamically learned deformable sampling positions, allowing the network to adaptively focus on the most salient target features. By integrating this mechanism, the network’s ability to capture fine-grained features of small objects in complex backgrounds is significantly enhanced, and its responsiveness to targets of varying scales is greatly improved, thereby further increasing overall detection accuracy and robustness.

(4) The original CIOU loss function of YOLOv8 is replaced with the EIOU loss function, which explicitly rectifies deviations in width and height between predicted and ground-truth bounding boxes. By decoupling the width-height loss calculations and introducing the minimum enclosing rectangle as a normalization factor, the EIOU loss function significantly enhances the localization precision of bounding boxes and accelerates model convergence during training.

Backbone network optimization

We replaced YOLOv8’s feature extraction backbone with MobileNetV3, a lightweight model proposed by Google25. It aims to improve deep learning efficiency on mobile devices by optimizing network structures through hardware-aware neural architecture search, introducing the efficient h-swish activation function, and incorporating the lightweight squeeze-and-excitation (SE) attention mechanism.

Specifically, MobileNetV3 adopts the Bottleneck module as its fundamental building block. Building upon the classic inverted residual structure of MobileNetV2, MobileNetV3 further integrates the lightweight Squeeze-and-Excitation (SE) attention module and efficient h-swish activation function to enhance feature extraction capability while reducing computational overhead. The Bottleneck structure of MobileNetV3 is illustrated in Fig. 8, where the SE module first performs global average pooling, followed by a fully-connected layer to generate a nonlinear transformation for channel-wise weight distribution. Finally, these weights are applied to the original feature maps to recalibrate the channel responses. The original swish activation function26 is defined as:

graphic file with name d33e452.gif 1

where Inline graphic denotes the sigmoid function, corresponding to the soft version of swish.

Fig. 8.

Fig. 8

MobileNetV3 bottleneck structure.

Although this nonlinearity activation function improves accuracy, the sigmoid function has high computational overhead on mobile devices, making it difficult to apply in embedded environments. Therefore, the sigmoid function is replaced by the hard-sigmoid activation function, defined as follows:

graphic file with name d33e467.gif 2

this yields hard-swish:

graphic file with name d33e474.gif 3

This function is the hard version of swish, implemented for better computational efficiency. Experiments show that the hard version of swish is similar in accuracy to the soft version, but offers higher computational efficiency and avoids precision loss due to the approximate sigmoid in quantized modes.

Neck network optimization

The original neck structure of YOLOv8 adopts multi-scale feature fusion, integrating features from different levels through progressive upsampling and downsampling operations. However, this cascaded multi-scale fusion process can lead to the loss of fine-grained details in lower-level features after multiple downsampling steps, with small object features especially prone to being overwhelmed by background information. Moreover, in the neck module of YOLOv8, the number of channels doubles after each Conv module, increasing the model’s parameter count and computational complexity. To address these issues, we incorporated the core concepts of the CNN-based Cross-cale Feature Fusion(CCFF) algorithm27 from RT-DETR into the neck structure of YOLOv8, as shown in Fig. 9.

Fig. 9.

Fig. 9

Improved YOLOv8 Neck with CCFF.

The CCFF algorithm is built upon an optimized cross-scale fusion module that embeds multiple fusion blocks composed of 1Inline graphic1 convolutions along the fusion path. These fusion blocks are designed to combine features from two adjacent scales into a new feature representation, using paired 1Inline graphic1 convolutions to align channel dimensions and ensure smooth integration. Inspired by this design, we reconstructed the feature fusion pathway in our model. First, the P3, P4, and P5 outputs from the backbone are individually passed through 1Inline graphic1 convolutions to unify their channel dimensions, thereby reducing the computational overhead of subsequent fusion operations. We then constructed a bidirectional path structure. In the upsampling path, high-level semantic features are fused with low-level detail features using a Conv + Upsample + Concat scheme. In the downsampling path, Conv and Concat operations form semantic enhancement feedback, ensuring that deep semantic information effectively guides shallow feature refinement. Compared to the original YOLOv8 approach, our improved model maintains the number of channels at 256 after the C2f module, whereas YOLOv8 doubles the channels at this stage. This refinement optimizes the computational pipeline, reduces complexity, and enhances small-object detection accuracy while preserving the lightweight nature of the model.

Deformable large Kernel attention

The original neck structure of YOLOv8 primarily uses local convolutions for feature fusion, which are limited by local receptive fields as well as fixed sampling grids and convolution kernel sizes. This setup struggles to capture the long-range global context within images and fails to adaptively adjust for object scale variations. As a result, it lacks sufficient long-range dependency modeling, particularly when processing small targets in complex backgrounds. To address this, we have integrated the Deformable Large Kernel Attention mechanism28 into YOLOv8’s neck structure, as illustrated in Fig. 10.

Fig. 10.

Fig. 10

The position of the D-LKA in the neck network of YOLOv8.

In traditional self-attention mechanisms29, relationships are calculated between each position and all others to capture global contextual information, which often results in significant computational and parameter overhead. The Large Kernel Attention mechanism employs large convolutional kernels to mimic the effect of a global receptive field. It reduces computational complexity by decomposing the convolution process as follows:

  • A depth-wise convolution is used for extracting local features and maintaining channel independence, which significantly lowers computational complexity by processing each input channel separately.

  • A depth-wise dilated convolution expands the receptive field without increasing the parameter count, effectively enabling the model to capture a broader range of contextual information.

  • A 1Inline graphic1 convolution integrates multi-channel information and reduces dimensionality, facilitating efficient compression and expansion of features across the network layers.

These three operations approximate the effect of self-attention with significantly lower computation. The specific operations are as follows: Let the input feature map have dimensions HInline graphicWInline graphicC, and define a large kernel of size K. First, the depth-wise convolution(DW) is employed to capture local features, with the formula as follows:

graphic file with name d33e588.gif 4

d refers to the dilation rate, which controls the coverage of the local area, effectively capturing detailed information from neighboring pixels. Subsequently, to further expand the receptive field and control the number of parameters, the depth-wise dilated convolution(DW-D) is utilized. The size of its kernel is jointly determined by the target large kernel size K and the dilation rate d, with the formula as follows:

graphic file with name d33e601.gif 5

Finally, by using a 1x1 convolution to adjust channel dimensions and aggregate multi-scale features, the feature representation capability is further enhanced. The cascading of these three components models the global context effectively, avoiding the parameter explosion associated with directly using large convolutional kernels. The parameters and FLOPs for the constructed large kernel structure are calculated by:

graphic file with name d33e608.gif 6
graphic file with name d33e615.gif 7

The FLOPs increase linearly with the size of the input image, while the number of parameters grows quadratically with the square of the number of channels and the size of the convolution kernel. However, since both values are typically small, they do not impose limitations on the algorithm’s efficiency and practicality.

To minimize the number of parameters for a fixed kernel size K, the derivative of equation 6 with respect to the dilation rate d can be taken, and set the derivative to zero:, thereby determining the optimal conditions for the dilation rate.

graphic file with name d33e633.gif 8

Building on the foundation of large kernel attention, deformable large kernel attention further optimizes feature extraction through dynamic offset learning. Deformable Convolution30 is a module that enhances traditional convolution operations, designed to improve the modeling capabilities of convolutional neural networks towards geometric transformations by dynamically adjusting the sampling positions. Traditional convolutions with fixed sampling grids struggle to adapt to complex scenarios such as object deformation and scale changes. In contrast, deformable convolution learns adaptive offsets, allowing the model to focus on local key features of the target, thereby enhancing detection accuracy and classification robustness for small objects in complex backgrounds. This convolution allows the network to adaptively adjust the sampling regions based on the input content, effectively capturing the non-rigid structure of objects. Experiments show that deformable convolution significantly improves performance in small-target instrument detection, while dynamically expanding the receptive field enhances the recognition capabilities for multi-scale objects, breaking through the geometric limitations of traditional convolution with lower computational costs, and providing more flexible feature expression for detecting small targets in complex backgrounds. The architecture of the deformable large kernel attention module is illustrated in Fig. 11, with its core expression as follows:

graphic file with name d33e648.gif 9
graphic file with name d33e654.gif 10

where the input feature is denoted by Inline graphic and Inline graphic

Fig. 11.

Fig. 11

Architecture of the Deformable LKA Module.

Incorporating the Deformable Large Kernel Attention (D-LKA) mechanism and the CNN-based Cross-cale Feature Fusion (CCFF) algorithm into the neck of YOLOv8 enhances the model’s ability to detect small objects while maintaining its lightweight design. D-LKA adjusts the receptive field, enabling the network to more effectively focus on features of various scales within the image, as shown in Fig. 12. After CCFF effectively fuses features from various scales via its convolutional layers, D-LKA is able to conduct a more detailed analysis and optimization of these integrated features. By interspersing D-LKA after C2f layer, the model utilizes the structured multi-scale features provided by CCFF to further enhance its ability to focus on and process crucial features. Therefore, the collaborative interaction between D-LKA and CCFF not only boosts the efficiency of feature utilization but also enhances the model’s performance in complex environments, showcasing a high degree of compatibility between the two.

Fig. 12.

Fig. 12

Comparison of heatmaps before and after adding D-LKA.

An efficient intersection over union

Given a predicted box Inline graphic and a target box Inline graphic, YOLOv8 originally employed the CIOU loss function31 expressed as follows:

graphic file with name d33e711.gif 11
graphic file with name d33e717.gif 12

where Inline graphic and Inline graphic are the width and height of the ground-truth bounding box, Inline graphic and Inline graphic are those of the predicted box. V measures the aspect ratio consistency between the two boxes. Although the CIOU loss considers the overlap area, the central point distance, and the aspect ratios, equation 12 reveals that it only reflects aspect ratio differences, not the actual relationships between Inline graphic and Inline graphic. The ambiguity of V can cause CIOU to optimize similarity in an unreasonable way and may also slow its convergence. Consequently, we have replaced the original CIOU loss in YOLOv8 with the Efficient Intersection over Union (EIOU)32, defined as:

graphic file with name d33e775.gif 13

Here, w and h are the width and height of the predicted box, while Inline graphic and Inline graphic are those of the ground-truth box. Inline graphic and Inline graphic represent the width and height of the smallest enclosing rectangle containing both boxes. EIOU abandons the indirect measurement of width and height through aspect ratio, directly addressing deviations in these dimensions, regardless of aspect ratio consistency. This ensures direct constraints on absolute width and height deviations. Moreover, EIOU decouples the loss calculations for width and height, allowing gradients to function independently within their respective dimensions, thereby enhancing the rationality and efficiency of optimization. Lastly, by introducing Inline graphic and Inline graphic as normalization factors, it ensures that width and height differences are measured proportionally, improving the loss function’s adaptability to various box sizes. These refinements resolve the ambiguity of the original V, significantly speed up model convergence, and improve localization accuracy. Accurate bounding box localization is crucial in instrument panel detection tasks, where even minor size errors can substantially impair the accuracy of subsequent corrections. By meticulously controlling each bounding box’s dimensions, EIOU ensures the predicted boxes align as closely as possible with the instrument panel edges, which is vital for subsequent image processing.

Results

Experimental environment and parameter

To ensure that the model is optimized effectively, the specific hardware and software settings used in this experiment are outlined in Table 2. These configurations are crucial for supporting deep learning tasks.

Table 2.

Experimental Environment.

Configuration Configuration parameters
CPU Intel Core i5 14600KF
GPU NVIDIA GeForce RTX 4060 Ti 16GB
System environment Windows10
Python 3.9.20
Pytorch 2.0.0
CUDA 11.8

Additionally, to ensure optimal training outcomes, the following hyperparameters were established, as detailed in Table 3.

Table 3.

Hyperparameters configuration.

Hyperparameters Parameter Value
Initial learning rate 0.01
Final learning rate factor 0.01
Momentum 0.937
Weight decay coefficient 0.0005
Input images 640
Epochs 150
Batch size 4

Performance evaluation metrics for object detection

To select the optimal model, we utilized classic deep learning object detection metrics such as Precision, Recall, mAP50, and mAP50-95. Beyond these performance indicators, lightweight models deployed on resource-constrained edge devices also necessitate consideration of computational efficiency and size. Consequently, we included FLOPs and Model Size as indicators of resource consumption. Here is a detailed introduction to these evaluation metrics:

Precision measures the proportion of identified positives that are correctly classified as such. Recall indicates the proportion of actual positives that were accurately detected. These metrics are calculated using True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN), with their respective formulas as follows:

graphic file with name d33e944.gif 14

High Precision indicates a lower rate of false positives, which is particularly important in complex backgrounds. Average Precision (AP) represents the average of Precision scores computed at various recall thresholds. The formula is as follows:

graphic file with name d33e951.gif 15

mAP is the mean of multiple AP values, calculated by averaging the AP values across all categories:

graphic file with name d33e958.gif 16

mAP50 is calculated as the Mean Average Precision at an IoU threshold of 0.5. mAP50-95 is a more rigorous metric that evaluates the model’s performance across multiple IoU thresholds ranging from 0.50 to 0.95.

FLOPs represent the number of floating-point operations required for one forward pass, serving as an important indicator of model complexity. Generally, a lower FLOPs count means the model runs faster. Weight Size refers to the total storage requirement for all the parameters within the model. The model’s size directly affects its ease of storage and deployment, especially on devices with limited memory, where smaller models are preferred.

The overall performance of YOLOv8-MCDE

To verify the effectiveness of the YOLOv8-MCDE algorithm proposed in this paper for detecting small targets on instrument panels against complex backgrounds, and to assess its feasibility for edge deployment, we conducted a comparative analysis with the original YOLOv8s model. Detailed results are shown in Table 4.

Table 4.

Comparative Results of Evaluation Metrics.

Network Model P/% R/% mAP FLOPs/G Weight Size/MB
@0.5 @0.5–0.95
YOLOv8s 90.42 93.08 90.09 76.80 28.4 22.5
YOLOv8-MCDE 92.80 90.62 91.36 76.22 17.7 14.4

The results presented in the table above demonstrate that YOLOv8-MCDE not only significantly reduces FLOPs and Weight Size but also enhances Precision and mAP50 compared to YOLOv8s. Specifically, Precision for YOLOv8-MCDE increased from 90.42% to 92.80%, marking an improvement of 2.38%. This significant reduction in the false positive rate is particularly crucial for identifying small targets against complex backgrounds. In practical scenarios with inspection robots, this increase in Precision can effectively minimize the uploading of unnecessary photos, thereby reducing the burden on resource-limited edge devices. While the model’s Recall slightly decreased, potentially leading to some missed detections, the targets typically missed are those with inherently poor image quality, which are unlikely to be correctly recognized later. Hence, this minor decline in Recall has a limited impact in real-world applications, whereas the improvement in Precision is more significant. Additionally, mAP50 improved from 90.09% to 91.36%, an increase of 1.27%, which enhances the model’s average detection accuracy for small targets. In terms of computational efficiency, FLOPs were reduced from 28.4G to 17.7G, a decrease of 37.68%, and the Model Size was also reduced from 22.5MB to 14.4MB, down by 36%. Compared to YOLOv8s, the YOLOv8-MCDE model has achieved significant optimizations in both computational resources and storage requirements while also showing improvements in detection performance. Figure 13 displays some of the detection results, while Fig. 14 illustrates the enhancements, with the left side showing conditions before the improvements and the right side after.

Fig. 13.

Fig. 13

Partial Detection Results.

Fig. 14.

Fig. 14

Visual comparison before and after model improvements.

Ablation experiment

To further validate the effectiveness of the YOLOv8-MCDE model for detecting small targets on instrument panels in complex environments, and to assess its enhancements over the original algorithm, this study carried out the following ablation experiments.

Comparative experiments on backbone networks

As indicated in Table 5, replacing YOLOv8’s backbone network with various lightweight networks resulted in reduced FLOPs and Weight Size, but generally led to a decrease in mAP as well. Notably, MobileNetV3 and ResNet101 showed higher levels of average precision. However, in comparison to ResNet101, MobileNetV3 not only improved mAP50 but also decreased FLOPs by 27.11%, whereas ResNet101 only achieved a minimal reduction in FLOPs of 0.04%. These findings demonstrate that MobileNetV3 offers a significant advantage in lightweight design while maintaining strong detection capabilities for small targets. Therefore, we have selected MobileNetV3 as the backbone network for YOLOv8-MCDE.

Table 5.

Comparison of different backbone network.

Network Model Backbone Network P/% R/% mAP FLOPs/G Weight Size/MB
@0.5 @0.5–0.95
YOLOv8s CSPDarknet-53 90.42 93.08 90.09 76.80 28.4 22.5
YOLOv8s MobileNetV1 87.85 91.28 86.98 74.14 27.2 22.2
YOLOv8s MobileNetV2 91.62 87.22 89.08 75.07 21.0 16.8
YOLOv8s MobileNetV3 92.17 92.81 90.23 76.64 20.7 21.2
YOLOv8s MobileNetV4 88.36 77.22 80.82 69.58 18.9 17.6
YOLOv8s ShuffleNetV2 89.02 89.63 90.87 72.31 18.5 15.0
YOLOv8s GhostNetv1 87.77 82.99 81.31 70.14 19.4 20.7
YOLOv8s GhostNetv2 88.02 86.59 83.87 71.01 19.9 22.8
YOLOv8s ResNet18 88.44 87.15 86.78 75.19 20.8 15.3
YOLOv8s ResNet101 90.14 88.83 90.84 76.56 27.4 19.5

Comparative experiments on backbone and neck network combinations

With MobileNetV3 confirmed as the optimal backbone network, different neck structures were selected for experimentation. The experimental results are shown in Table 6.

Table 6.

Comparison of network necks.

Backbone Network Neck P/% R/% mAP FLOPs/G Weight Size/MB
@0.5 @0.5–0.95
MobileNetV3 CCFM 93.00 90.47 90.63 77.32 15.2 13.5
MobileNetV3 RepGFPN 89.91 91.13 87.69 75.59 21.2 22.6
MobileNetV3 BiFPN 92.24 86.26 90.56 77.05 20.3 19.4
MobileNetV3 Slim-Neck 91.82 84.19 87.47 74.02 17.4 19.5
MobileNetV3 ASF-YOLO 92.63 88.76 89.84 76.82 21.8 21.6
MobileNetV3 HS-FPN 94.17 84.77 88.81 74.76 16.3 13.4

As indicated in the table above, the CCFM neck structure performs exceptionally well, achieving a Precision of 93.00%, which represents an increase of 2.58%. It also scored the highest in mAP50-95 at 77.32%, while reducing FLOPs to 15.2G, a decrease of 26.57%. This demonstrates a successful balance between performance and computational cost. The BiFPN neck structure closely follows in mAP50-95 performance, though it requires 5.1G more in computational resources compared to CCFF. Other structures like Slim-Neck, ASF-YOLO, and RepGFPN exhibit moderate performance across the metrics and do not compare favorably with CCFM. Therefore, combining the MobileNetV3 backbone with the CCFF neck structure emerges as the optimal choice to meet the dual demands of lightness and high precision.

Comparative experiments on backbone, neck, and attention mechanism combinations

In Table 7, we explored the optimal combinations using MobileNetV3 as the backbone and CCFF as the neck network, with the addition of various attention mechanisms. These mechanisms are integrated into the neck structure to enhance the model’s ability to process key image features. By emphasizing important information and minimizing the impact of background noise, these mechanisms increase the model’s precision in detecting targets of varying sizes and complexities, and reduce the potential for false positives. This setup not only optimizes feature representation but also boosts the model’s efficiency in resource-constrained environments, ensuring that the model maintains high detection accuracy and overall performance while remaining lightweight.

Table 7.

Comparison of different attention mechanisms.

Model Backbone Neck Attention P/% R/% mAP FLOPs/G Weight Size/MB
@0.5 @0.5–0.95
Model 1 MobileNetV3 CCFF - 93.00 90.47 90.63 77.32 15.2 13.5
Model 2 MobileNetV3 CCFF DAT 89.58 90.24 88.43 74.79 15.3 13.7
Model 3 MobileNetV3 CCFF SEAM 92.83 88.27 89.90 75.48 15.6 13.7
Model 4 MobileNetV3 CCFF ACmix 92.41 89.06 88.37 74.85 16.1 13.6
Model 5 MobileNetV3 CCFF LSKAttention 92.90 92.80 90.40 75.13 15.3 13.6
Model 6 MobileNetV3 CCFF iBMB 90.30 88.76 89.20 77.73 15.6 13.7
Model 7 MobileNetV3 CCFF RCS-OSA 94.39 82.30 89.32 75.15 29.0 19.0
Model 8 MobileNetV3 CCFF D-LKA 90.67 88.82 91.24 77.91 17.7 14.4
Model 9 MobileNetV3 CCFF TripletAttention 90.36 88.76 90.34 76.65 15.2 13.5
Model 10 MobileNetV3 CCFF MSDA 91.39 90.54 89.05 76.70 15.3 13.6
Model 11 MobileNetV3 CCFF FocusedLinearAttention 88.51 88.94 87.27 74.03 15.3 13.6

The data from the table reveals that different attention mechanisms have a significant impact on both accuracy and model complexity. Some attention modules enhance model performance, though not always markedly, and some can even degrade performance. Notably, RCS-OSA and D-LKA are particularly significant. RCS-OSA boosts Precision to 94.39% but significantly lowers Recall to 82.30%, with FLOPs increasing dramatically to 29G, nearly double that of other configurations. On the other hand, D-LKA increases mAP50 and mAP50-95 to 91.24% and 77.91% respectively, the highest among all attention mechanisms tested. This indicates that introducing D-LKA enhances the model’s average precision across all IoU thresholds, significantly improving overall detection accuracy. Importantly, D-LKA has a model size of 17.7G. While it adds only 2.5G in parameters compared to the original setup, it remains substantially lower than more complex attention mechanisms like RCS-OSA, striking a reasonable balance between efficiency and accuracy.

To further explore the impact of the insertion point of D-LKA within the YOLOv8 neck structure on model performance, this study conducted several comparative experiments, as detailed in Table 8. These experiments involved placing D-LKA at different points within the neck network and assessing various performance metrics to determine the optimal placement for D-LKA. The various insertion points for D-LKA are depicted in Fig. 15.

Table 8.

Positional Comparison of D-LKA within the Neck Network.

Model Backbone Neck Attention Position P/% R/% mAP FLOPs/G Weight Size/MB
@0.5 @0.5–0.95
Model 1 MobileNetV3 CCFF D-LKA Position 1 89.63 88.76 88.76 77.07 17.1 13.8
Model 2 MobileNetV3 CCFF D-LKA Position 2 90.29 87.13 90.05 76.56 15.7 13.8
Model 3 MobileNetV3 CCFF D-LKA Position 3 90.67 88.82 91.24 77.91 17.7 14.4
Fig. 15.

Fig. 15

Different insertion points of D-LKA.

As shown in Table 8, Position 3 demonstrates the best overall performance and was therefore selected for subsequent experiments and deployment in this study. Specifically, it achieves the highest mAP50 and mAP50-95 scores among all tested positions, reaching 91.24% and 77.91%, respectively. These results indicate that Position 4 offers excellent overall performance in small object detection, with particularly strong results under the more rigorous evaluation metric of mAP50-95.

Comparative experiments on loss functions

To assess the impact of different loss functions on detection accuracy, several alternatives were compared against the original loss function used in YOLOv8. The detailed results are shown in Table 9.

Table 9.

Comparison of loss functions.

Model Backbone Neck Attention Loss P/% R/% mAP FLOPs/G Weight Size/MB
@0.5 @0.5–0.95
Model 1 MobileNetV3 CCFF D-LKA CIOU 90.67 88.82 91.24 77.91 17.7 14.4
Model 2 MobileNetV3 CCFF D-LKA WIOU 75.25 90.82 84.23 50.19 17.7 14.4
Model 3 MobileNetV3 CCFF D-LKA GIOU 86.30 89.32 90.28 76.84 17.7 14.4
Model 4 MobileNetV3 CCFF D-LKA CIOU 88.14 89.06 87.82 75.77 17.7 14.4
Model 5 MobileNetV3 CCFF D-LKA EIOU 92.80 90.62 91.36 76.22 17.7 14.4
Model 6 MobileNetV3 CCFF D-LKA DIOU 89.37 88.51 87.74 75.22 17.7 14.4
Model 7 MobileNetV3 CCFF D-LKA inner_SIOU 91.04 89.92 89.75 76.81 17.7 14.4
Model 8 MobileNetV3 CCFF D-LKA inner_WIOU 79.76 91.17 84.12 55.87 17.7 14.4
Model 9 MobileNetV3 CCFF D-LKA inner_GIOU 88.26 84.47 85.65 73.93 17.7 14.4
Model 10 MobileNetV3 CCFF D-LKA inner_DIOU 92.19 89.88 90.62 76.03 17.7 14.4
Model 11 MobileNetV3 CCFF D-LKA inner_EIOU 89.73 82.54 87.85 76.57 17.7 14.4
Model 12 MobileNetV3 CCFF D-LKA inner_CIOU 89.72 88.27 86.58 75.05 17.7 14.4
Model 13 MobileNetV3 CCFF D-LKA ShapeIOU 90.33 89.05 87.31 75.38 17.7 14.4
Model 14 MobileNetV3 CCFF D-LKA InnerShapeIOU 91.82 89.06 90.86 75.72 17.7 14.4

Overall, different loss functions show varying performance in terms of Precision, Recall, and mAP. Most achieve a Precision above 88%, while some, such as WIOU and inner_WIOU, perform noticeably worse, with Precision dropping to around 77%. In contrast, EIOU delivers the best performance, achieving the highest Precision at 92.80%, an improvement of 2.13% over the original. It also reaches the highest Recall and mAP50 scores, with increases of 1.8% and 0.12%, respectively. Therefore, in this experiment, the EIOU loss function proves to be the most effective for small object detection in complex environments.

Ablation experiment results

Table 10 presents four key enhancements introduced in this study. First, by replacing the original backbone network with MobileNetV3, the model’s complexity and size were significantly reduced while increasing Precision by 1.75% and mAP50 by 0.14%. The integration of the CCFF algorithm into the neck network led to a 46.48% reduction in model complexity, and improved Precision by 0.83%, mAP50 by 0.13%, and mAP50-95 by 0.68%. Moreover, incorporating the D-LKA attention mechanism into the neck network increased mAP50 and mAP50-95 by 0.61% and 0.59%, respectively. Lastly, replacing the original CIOU loss function with the more efficient EIOU resulted in increases of 2.13% in Precision and 0.12% in mAP50. Collectively, these modifications significantly lowered the model’s complexity while markedly enhancing its accuracy for detecting small targets on instrument panels.

Table 10.

Results of the Ablation Experiments.

Network Model P/% R/% mAP FLOPs/G Weight Size/MB
@0.5 @0.5–0.95
YOLOv8s 90.42 93.08 90.09 76.80 28.4 22.5
YOLOv8s+MobileNetV3 92.17 92.81 90.23 76.64 20.7 21.2
YOLOv8s+MobileNetV3+CCFF 93.00 90.47 90.63 77.32 15.2 13.5
YOLOv8s+MobileNetV3+CCFF+D-LKA 90.67 88.82 91.24 77.91 17.7 14.4
YOLOv8s+MobileNetV3+CCFF+D-LKA+EIOU 92.80 90.62 91.36 76.22 17.7 14.4

Comparison of different algorithm types

To further demonstrate the advantages of the proposed YOLOv8-MCDE algorithm over existing mainstream object detection methods in practical industrial scenarios, we conducted comparative experiments under consistent conditions, including identical training settings and datasets. Specifically, we selected several representative detection algorithms introduced in recent years, including YOLOv5s, YOLOv8s, YOLOv10s, YOLOv11s, YOLOv12s, YOLOv13s, as well as the popular RT-DETR and the classical two-stage detector Faster R-CNN.

As shown in Table 11, YOLOv8-MCDE achieves superior performance across multiple key evaluation metrics, notably obtaining the best results in Precision and mAP50. It is particularly noteworthy that YOLOv8-MCDE also exhibits the lowest inference cost, with only 17.7 G, significantly outperforming other one-stage models and far exceeding RT-DETR (108.0 G) and Faster R-CNN (470.5 G) in terms of computational efficiency. These results highlight the strong lightweight design of the proposed approach without compromising detection accuracy, making it highly suitable for deployment on edge devices.

Table 11.

Comparative Experiments of the YOLO Series and Other Methods.

Model P/% R/% mAP FLOPs/G
@0.5 @0.5–0.95
YOLOv5s 88.27 89.67 90.63 74.96 23.8
YOLOv8s 90.42 93.08 90.09 76.80 28.4
YOLOv10s 88.38 83.76 86.23 72.64 24.4
YOLOv11s 92.80 89.42 90.79 76.88 21.3
YOLOv12s 89.24 91.13 88.42 74.85 21.2
YOLOv13s 88.70 90.77 90.40 76.20 21.0
RT-DETR 47.82 52.25 46.66 36.21 108.0
Faster R-CNN 62.94 86.69 84.88 470.5
YOLOv8-MCDE 92.80 90.62 91.36 76.22 17.7

In conclusion, YOLOv8-MCDE achieves an impressive balance between detection accuracy and computational efficiency, demonstrating strong potential for practical application, especially in industrial scenarios where both precision and speed are critical for automated object detection tasks.

Discussion

The YOLOv8-MCDE model, despite significant lightweighting, has shown improved detection performance for small instrument objects in complex backgrounds, which marks substantial improvements over the original YOLOv8, already considered a robust framework.These enhancements are critical when the operational environment demands high accuracy with constrained computational resources.

To further assess the effectiveness and generalization capability of the proposed YOLOv8-MCDE model for instrument detection, we conducted comparative experiments on a public instrument dataset comprising 783 images of pointer meters. As shown in Table 12, the proposed YOLOv8-MCDE model achieves comparable or superior performance across several evaluation metrics when compared to the original YOLOv8s, demonstrating its robustness and adaptability in real-world industrial scenarios.

Table 12.

Performance Comparison of YOLOv8s and YOLOv8-MCDE on the Public Instrument Dataset.

Dataset P/% R/% mAP
@0.5 @0.5–0.95
YOLOv8s 99.86 100.00 99.50 98.67
YOLOv8-MCDE 100.00 99.83 99.50 98.92

Specifically, YOLOv8-MCDE achieves a 0.14% improvement in Precision compared to the original YOLOv8s model, indicating a meaningful reduction in the false positive rate. In practical inspection scenarios, false detections can mislead subsequent meter reading processes and result in unnecessary image uploads, which in turn increase the computational burden on edge devices with limited processing capabilities. Although the Recall shows a slight decline, potentially leading to a small number of missed detections, these are typically associated with low-quality images or severely occluded meters that are difficult to recognize even in post-processing stages. Therefore, the practical impact of this minor decrease in Recall is limited. From an application standpoint, false positives tend to introduce system-level redundancy and noise, which can degrade the reliability of the overall pipeline. Consequently, in real-world deployments, achieving higher Precision is often prioritized over maximizing Recall–making YOLOv8-MCDE particularly well-suited for resource-constrained edge environments. From a practical deployment perspective, false positives are often more detrimental than false negatives, as they introduce noise and redundancy into the system. Therefore, achieving higher Precision is generally more critical than maximizing Recall, especially in resource-constrained environments. This makes YOLOv8-MCDE particularly well-suited for deployment on edge devices used in inspection robots.

In terms of localization accuracy, YOLOv8-MCDE also achieves a 0.25% improvement in the stricter mAP50-95 metric, increasing from 98.67% to 98.92%. This demonstrates its enhanced bounding box regression performance and consistent accuracy across a range of IoU thresholds, particularly under complex backgrounds. These gains are attributed to the structural improvements introduced in YOLOv8-MCDE. More importantly, the model delivers not only strong results on the expanded Meter Challenge dataset but also stable and reliable performance on an independent public dataset, confirming its excellent generalization ability and practical viability for industrial deployment.

Author contributions

T.Y. conceived the overall research framework, designed the experiments, implemented the algorithm, and performed model training and evaluation. D.L. prepared and processed the dataset. Q.S. and T.J. supervised the project, revised the manuscript, and provided critical feedback. All authors reviewed and approved the final manuscript.

Data availability

The data used in this study are available from the corresponding authors upon reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Qingwu Shi, Email: shiqingwu@jmsu.edu.cn.

Tianyue Jiang, Email: 13704503245@163.com.

References

  • 1.Salomon, G., Laroca, R. & Menotti, D. Deep learning for image-based automatic dial meter reading: Dataset and baselines. In 2020 International joint conference on neural networks (IJCNN), 1–8 (IEEE, 2020).
  • 2.Allan, J.-F. & Beaudry, J. Robotic systems applied to power substations-a state-of-the-art survey. In Proceedings of the 2014 3rd International Conference on Applied Robotics for the Power Industry, 1–6 (IEEE, 2014).
  • 3.Liu, C. & Wu, Y. Research progress of vision detection methods based on deep learning for transmission lines. Proc. CSEE43, 7423–7446 (2023). [Google Scholar]
  • 4.Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Transactions on Instrumentation Meas.71, 1–13 (2021). [Google Scholar]
  • 5.Zheng, C., Wang, S., Zhang, Y., Zhang, P. & Zhao, Y. A robust and automatic recognition system of analog instruments in power system by using computer vision. Measurement92, 413–420 (2016). [Google Scholar]
  • 6.Liu, Y., Liu, J. & Ke, Y. A detection and recognition system of pointer meters in substations based on computer vision. Measurement152, 107333 (2020). [Google Scholar]
  • 7.Zuo, L., He, P., Zhang, C. & Zhang, Z. A robust approach to reading recognition of pointer meters based on improved mask-rcnn. Neurocomputing388, 90–101 (2020). [Google Scholar]
  • 8.Shen, J., Liu, N., Sun, H., Li, D. & Zhang, Y. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention fine-grained features. IEEE Transactions on Instrumentation Meas.73, 1–16 (2024). [Google Scholar]
  • 9.Zhang, Q. et al. Water meter pointer reading recognition method based on target-key point detection. Flow Meas. Instrumentation81, (2021).
  • 10.Zhou, D., Yang, Y., Zhu, J. & Wang, K. Intelligent reading recognition method of a pointer meter based on deep learning in a real environment. Meas. Sci. Technol.33, (2022).
  • 11.Xu, W., Wang, W., Ren, J., Cai, C. & Xue, Y. A novel object detection method of pointer meter based on improved yolov4-tiny. Appl. Sci.13, 3822 (2023). [Google Scholar]
  • 12.Hou, L., Wang, S., Sun, X. & Mao, G. A pointer meter reading recognition method based on yolox and semantic segmentation technology. Measurement218, 113241 (2023). [Google Scholar]
  • 13.Jin, F., Chu, Z., Zhu, M., Chen, B. & Dong, X. An adaptive robust pointer meter automatic reading algorithm based on car-unet. IEEE Transactions on Instrumentation Meas.74, 1–12 (2025). [Google Scholar]
  • 14.Shu, Y., Liu, S., Xu, H. & Jiang, F. Read pointer meters based on a human-like alignment and recognition algorithm. In CCF National Conference of Computer Applications, 162–178 (Springer, 2023).
  • 15.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788 (2016).
  • 16.Redmon, J. & Farhadi, A. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7263–7271 (2017).
  • 17.Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
  • 18.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
  • 19.Jocher, G. et al. Yolov5, ultralytics/yolov5: Initial release (2020).
  • 20.Li, C. et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022).
  • 21.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7464–7475 (2023).
  • 22.Varghese, R. & Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), 1–6 (IEEE, 2024).
  • 23.Sohan, M., Sai Ram, T. & Rami Reddy, C. V. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics, 529–545 (Springer, 2024).
  • 24.Jiang, T. et al. Yolov8-go: A lightweight model for prompt detection of foliar maize diseases. Appl. Sci.14, 10004 (2024). [Google Scholar]
  • 25.Howard, A. et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, 1314–1324 (2019).
  • 26.Ramachandran, P., Zoph, B. & Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).
  • 27.Zhao, Y. et al. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16965–16974 (2024).
  • 28.Azad, R. et al. Beyond self-attention: Deformable large kernel attention for medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1287–1297 (2024).
  • 29.Zhao, H., Jia, J. & Koltun, V. Exploring self-attention for image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10076–10085 (2020).
  • 30.Dai, J. et al. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, 764–773 (2017).
  • 31.Zheng, Z. et al. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE transactions on cybernetics52, 8574–8586 (2021). [DOI] [PubMed] [Google Scholar]
  • 32.Zhang, Y.-F. et al. Focal and efficient iou loss for accurate bounding box regression. Neurocomputing506, 146–157 (2022). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used in this study are available from the corresponding authors upon reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES