Abstract
To address the challenges of low localization accuracy, weak feature discrimination, and real-time constraints in small-scale metallic surface defect detection, this study proposes a novel detection framework named FF-MDE (fine-grained multi-branch diversified encoder). The proposed model enhances detection capability through three key modules: a direction-aware multi-branch convolutional block (BBDES) that strengthens orientation-sensitive feature extraction via re-parameterization; a multi-scale fusion network (MFFN) incorporating a location-offset heterogeneous kernel selection strategy and a channel cross-embedding mechanism to align cross-resolution features and expand receptive fields adaptively; and a lightweight attention-guided up-sampling that improves fine-detail recovery while suppressing irrelevant responses. A large number of experiments on the NEU-DET and GC10-DET datasets show that FF-MDE has achieved excellent performance. The indicators of mAP50 are 70.3 and 57.9% respectively, which are 4.1 and 3.3% higher than the benchmark model and 7.6 and 6.5% higher on average than the existing methods. Moreover, the real-time inference speed of this network exceeds 60 FPS, providing a powerful and deployable solution for high-precision defect detection in industrial vision inspection systems.
Keywords: Metal defects, Multibranch architecture, Heterogeneous nuclei, Gating mechanism, Fine-grained characterization
Subject terms: Engineering, Mathematics and computing
Introduction
In industrial sectors such as high-end manufacturing, transportation, and steel smelting, microscopic defects on metal surfaces—often imperceptible to the naked eye—can induce structural fatigue, crack propagation, and catastrophic failure under complex service conditions1,2. For instance, in components like high-speed train axles and aerospace structures, initially undetected minute cracks can evolve into fracture initiation points under alternating loads, directly jeopardizing equipment safety and human lives3. Therefore, developing high-precision and high-efficiency metal defect detection methods remains a critical challenge in intelligent manufacturing quality control. This challenge is amplified in real-world environments characterized by high imaging noise, small target dimensions, and strong process interference.
Driven by advancements in deep learning4,5, convolutional neural networks (CNNs) have become a mainstream approach for surface defect detection. Classic single-stage detection methods (e.g., SSD6 are widely adopted in industrial inspection for their end-to-end architecture and inference efficiency. In contrast, two-stage detection methods (e.g., RCNN7, Fast-RCNN8, Mask-RCNN9 excel in target localization accuracy. However, metal surface defects often exhibit irregular shapes, similar textures, and sparse distribution10. Traditional convolutional structures thus face significant bottlenecks in preserving small-scale information, modeling cross-scale semantic consistency, and achieving robustness. Particularly in high-resolution images, issues such as feature blurring and false detections frequently arise11. Consequently, the Transformer-based encoder-decoder architecture offers a novel approach for object detection. It leverages self-attention mechanisms to achieve robust global modeling capabilities while simplifying end-to-end structures12. Nevertheless, when applied to industrial defect detection, these methods face significant challenges: (1) Insufficient directional selectivity and structural awareness during feature extraction lead to an inadequate response to subtle textures13. (2) An inherent incompatibility between shallow and deep semantic levels—regarding spatial resolution and feature scale—severely limits cross-scale feature fusion effectiveness14. (3) The decoding stage’s up-sampling process lacks adaptive control, resulting in insufficient detail reconstruction and limiting the precise localization of small defects15. These challenges are particularly pronounced in real-world tasks involving strong noise backgrounds, multi-scale targets, and heterogeneous defect types.
To address these challenges, this paper focuses on three key dimensions: fine-grained feature modeling, cross-scale semantic alignment, and structural efficiency. We propose a novel multi-branch diversified encoder architecture (Fine-grained Multi-Branch Diversified Encoder, FF-MDE). Grounded in an analysis of target characteristics and modeling bottlenecks in industrial inspection, this framework constructs three pathways: direction-sensitive feature extraction, cross-scale fusion of heterogeneous kernels, and detail enhancement via gating mechanisms. It effectively reconciles the trade-offs between detection accuracy, feature representation, and deployment efficiency, a common limitation in existing methods. The main contributions are as follows:
At the feature extraction stage, we introduce a multi-branch direction-aware feature extraction mechanism. Multi-scale convolutions and reparameterized structures across distinct branches enhance the representation of feature edges and defect texture orientations. This addresses the insufficient response to microstructures in traditional backbone networks.
At the encoder stage, we construct a multi-scale collaborative modeling network (MFFN). This network integrates heterogeneous kernel selection strategies with positional perturbation mechanisms to achieve semantic consistency across scales. Additionally, we introduce a channel embedding fusion mechanism to reconstruct cross-resolution semantic relationships. This mechanism strengthens the encoder’s feature extraction capabilities for surface defects and preserves more fine-grained feature information.
In the feature interaction fusion stage, we design a lightweight gated sampling mechanism. This mechanism employs dynamic attention regulation to enhance responses in defect regions, thereby improving boundary restoration and detail recovery in complex backgrounds.
Discussion
Metal surface defect detection methodologies have evolved through three distinct phases. The first phase, relying on traditional algorithms with manually engineered features, encountered robustness limitations under complex operating conditions16. The second phase, driven by feature learning exemplified by Convolutional Neural Networks (CNNs), achieved a significant leap in accuracy and efficiency. However, their inherent local receptive fields constrained the ability to model global context and long-range dependencies. The third phase emerged with the Transformer architecture, which addresses the locality constraints inherent in convolution. This architecture leverages self-attention mechanisms to capture global dependencies, offering a novel, high-performance solution for handling irregularly shaped, large-scale industrial defects.
Target feature extraction algorithm
Efficient multi-scale feature fusion remains a core challenge in target feature extraction. Traditional detection methods predominantly employ simple feature addition or concatenation for aggregation. However, this approach is suboptimal, as it can lead to information loss or redundancy17 and introduce noise during the fusion of high- and low-dimensional features. Consequently, models often suffer from “local perception” limitations, leading to inaccurate target localization and poor edge segmentation due to insufficient fine-grained detail18. To overcome these limitations, researchers have proposed various optimization strategies. At the decoder architecture level, Xiao et al.19 and Hu et al.20 adopted dual-decoder or progressive decoder designs to reduce feature loss and suppress noise during decoding, refining features through gradual up-sampling. At the feature representation level, Huang et al.21 designed a compact and efficient Channel MLP (CM) module to improve channel information extraction. This module combines channel attention with an MLP, enhancing the model’s feature extraction capability while maintaining a comparable computational cost. Furthermore, for multimodal tasks such as RGB-D, Wang et al.22 highlighted the need for depth-assisted cross-modal fusion modules to address challenges such as low-quality depth maps. In summary, modern detection models are evolving from simple feature aggregation toward more sophisticated, multidimensional fusion strategies.
CNN-based object detection algorithm
Object detection methods based on Convolutional Neural Networks (CNNs) are widely adopted in industrial quality inspection due to their simple architecture and fast inference. However, in complex industrial scenarios, their ability to model local details and fuse semantic information is limited, restricting the accurate identification of minute defects. To address this limitation in local detail modeling, researchers have proposed multiple improvements. Lu et al.23 designed the WIoU loss function to enhance localization accuracy and introduced the C2f-DSC module to improve fine-grained feature modeling. Xie et al.24 developed a lightweight multi-scale feature extraction module and an efficient detection head to enhance feature representation. Liang et al.25 proposed HDFA-Net, which utilizes frequency decoupling to separately model low-frequency global context and high-frequency local details, significantly improving small target discrimination. Luo et al.26 addressed low-contrast and camouflaged defects by introducing dynamic texture enhancement and adaptive scale balancing modules, which enhanced modeling capabilities for blurred boundaries and weak feature regions. Concurrently, to address feature extraction and fusion efficiency, Zhang et al.27 proposed a dual-agent path learning mechanism based on reinforcement learning and knowledge graph reasoning to enhance semantic understanding. Hu et al.28 strengthened structural information extraction using self-supervised contrastive learning and a convolutional feature retention mechanism (SCRL-EMD), offering insights for data-scarce scenarios. Furthermore, Zhang et al.29 designed a self-attention mechanism to capture global dependencies among fault types, combined with a frequency-based FCA module to extract cross-channel local features. Rui et al.30 employed a shape intersection ratio-based anchor box clustering algorithm to improve regression efficiency and constructed Asymmetric Convolutional Networks (ACNNs) for multi-level feature extraction. Chen et al.31 designed parallel and serial convolutional modules to obtain richer feature representations, thereby enhancing the detection of minute defects. Although these improvements enhance specific model capabilities while maintaining inference speed, their comprehensive modeling of local details and deep semantic information fusion remains insufficient. This insufficiency is particularly evident in real industrial scenarios, which feature complex backgrounds and defects with vast variations in size, morphology, and contrast. Consequently, the accuracy gains for identifying minute defects remain limited. This limitation forms a critical bottleneck for applying such methods in high-precision industrial quality inspection.
Transformer-based object detection algorithm
The Transformer encoder architecture has driven the adoption of encoder-decoder object detection methods for surface defect detection32. These approaches leverage self-attention mechanisms for global modeling, yielding strong semantic representation and robustness. This is particularly effective for identifying minute objects against complex textured backgrounds. Lee et al.33 developed the DropKey mechanism, which regularizes attention weights by randomly discarding key values before the Softmax layer. This process forces the model to focus on broader global contexts. At the architectural level, Li et al.34 proposed the REDef-DETR model. It enhances the receptive field and accuracy using context expansion modules and group attention, though this approach incurs a significant computational cost. To address this, Mao et al.35 developed the lightweight LRT-DETR framework. It employs a MobileNetV3 backbone and optimizes feature extraction using DWConv and VoVGSCSP modules. This design effectively reduces computational costs while achieving efficient detection of surface defects on steel. However, this lightweight approach compromises the model’s perception of small objects. To overcome potential semantic information loss and communication deficiencies in Transformers for visual tasks, Lin et al.36 proposed a deformable attention mechanism and an encoding-decoding communication module. Hong et al.37 employed a capsule autoencoder to construct geometry-aware representations, which are crucial for handling defects with occlusions or missing features. Furthermore, to address limitations in handling blurred and occluded regions, Xie et al.38 introduced Chebyshev Graph Convolution (CGConv) and Higher-Order Graph Convolution (HOGConv) to account for multi-order neighbors. They also employed a Dynamic Adjacency Matrix (GDAMFormer) to represent dynamic relationships between components, which helps overcome traditional method constraints. Furthermore, Wu et al.39 enhanced detection speed and feature expressiveness for blade surface defects by incorporating a multi-scale feature pyramid and Inner-GIoU loss. However, the limited depth of their encoder and insufficient fine-grained modeling40 introduced new bottlenecks. Overall, encoder-based methods demonstrate significant advantages in modeling global dependencies and semantic consistency. Nevertheless, critical aspects such as directional awareness, scale consistency modeling, and detail recovery remain suboptimal. This limitation hinders their generalization and real-time deployment in complex industrial quality inspection scenarios.
Methodology
We select ResNet-1841 as the backbone network. Its shallower architecture mitigates gradient dispersion and feature degradation issues associated with excessive depth compared to deeper variants like ResNet-34/50. This architecture maintains structural simplicity while significantly enhancing feature representation in intermediate and shallow layers. The overall structure of FF-MDE is depicted in (Fig. 1).
Fig. 1.
FF-MDE model structure diagram.
Diverse direction-sensitive feature extraction strategies
Although the BasicBlock in ResNet mitigates gradient vanishing through residual connections, its core module employs a single-branch 3 × 3 convolutional stack. This design limits expressiveness in multi-scale and orientation-aware modeling, particularly for defects with irregular textures and pronounced orientations (e.g., scratches, cracks). Consequently, its feature extraction capability struggles to satisfy fine-grained detection requirements. To address this, we integrate an enhanced module with multi-branch reparameterization42 into the shallow backbone, constructing the BBDES. This module incorporates heterogeneous convolutional branches, asymmetric directional sensing paths, and multiscale context fusion to significantly enhance perceptual resolution and semantic representation. Its structure is depicted in (Fig. 2).
Fig. 2.
BBDES model structure diagram.
During training, BBDES incorporates the following branch structure: Local-aware branch: Uses 1 × 1 convolution to extract compact local features, enhancing fine-texture modeling. Scale-aware branch: Employs serial 1 × 1 and
convolutions to expand the receptive field and improve scale adaptability. Orientation-aware branch: Introduces asymmetric convolutions (
and
) to strengthen directional sensitivity along horizontal and vertical axes. Contextual modeling branch: Aggregates contextual information via average pooling to capture global features. Constant mapping path: Preserves original features through skip connections, improving network stability and gradient flow. Collectively, these branches target distinct perceptual dimensions, forming a multi-perspective, multi-scale fine-grained characterization system. During training, the overall computation simplifies to a weighted summation of multi-branch convolutions, with the computational formula expressed as:
![]() |
1 |
During inference, parameters are fused into an equivalent single
convolutional kernel. This transformation preserves the training-phase diversity modeling capabilities while maintaining computational efficiency. Furthermore, BBDES integrates seamlessly into ResNet’s residual paths, where its outputs are summed with identity mappings. This enhances directional sensitivity and feature discriminability while stabilizing deep gradient propagation, thereby mitigating network degradation.
Linear optimization approach for encoders
Traditional self-attention in encoders requires computing pairwise token similarities, resulting in quadratic complexity and memory growth relative to token count. This complexity becomes particularly prohibitive for high-resolution images or long token sequences, restricting model scalability and deployment feasibility. To address these limitations, we propose AIFITSSA: a statistically-driven linear attention mechanism. Building on Token Statistics Self-Attention (TSSA)43 with variational optimization and coding rate compression, we design a linear-complexity encoder for structured feature learning. This dramatically reduces computation while preserving modeling capacity, applying Maximal Coding Rate Reduction (MCR²) to global and grouped coding rates, with the objective function:
![]() |
2 |
In Eq. (2),
represents the overall coding rate, measuring the distribution range of all features. Maximizing
promotes maximum feature dispersion.
denotes the grouped compression coding rate, quantifying feature compression within subgroups.
is the feature matrix, and
the group assignment matrix. Integrating this attention mechanism reconstructs the compression term
variationally, yielding a linear-complexity optimization model. Specifically, we reformulate
variationally using an orthogonal projection matrix
, as expressed in:
![]() |
3 |
In Eq. (3),
denotes a concave function that transforms the spectral function of high-dimensional covariance matrices into low-dimensional projection statistics, suppressing low-energy components. Key parameters include:
variational upper bound of
;
token count in group
;
grouping probability distribution over tokens. We compute gradients for variational target descent, thereby adjusting global attention weights as defined by:
![]() |
4 |
![]() |
5 |
In Eq. (4),
represents a dynamically adjusted weight matrix based on second-order moment statistics, incorporating a nonlinear activation component that suppresses low-magnitude statistical directions. Here,
denotes a learnable assignment parameter. Using these statistics, projected features undergo scaling to enhance salient features while suppressing noise and irrelevant components. Processed features are back-projected into the original space and combined via skip connections to yield updated token features. The AIFITSSA structure is illustrated in (Fig. 3).
Fig. 3.
AIFITSSA model structure diagram.
Multi-scale fusion fine-grained feature network encoder
Robust multi-scale fine-grained feature perception is critical for defect detection tasks involving high-resolution images with substantial scale variations and complex background interference. To enhance feature representation and fusion, we propose a Multi-scale Fusion Fine-grained Network (MFFN) embedded within the encoder architecture. First, the MFFN module incorporates the bidirectional feature pyramid network (BIFPN)44. Traditional feature pyramids often suffer from one-way information flow. BIFPN fundamentally improves this by introducing efficient bidirectional cross-scale connections. This allows high-level semantic features to be effectively enriched with low-level fine-grained details, and vice versa. More critically for defect detection, BIFPN employs a fast normalized fusion strategy using learnable weights. This weighting mechanism empowers the network to dynamically prioritize the most informative features from different scales. Consequently, the model learns to effectively amplify subtle defect characteristics while suppressing irrelevant background interference. We integrate this advanced fusion structure with heterogeneous convolutional kernels, creating a synergistic framework that further enhances feature interaction and diversity. Second, global channels are expanded from 128 to 192 (a 1.5× increase). This modification addresses capacity bottlenecks in handling fine-grained patterns by enhancing feature mapping. The MFFN module architecture is illustrated in (Fig. 4).
Fig. 4.
Module structure in MFFN (a) shows the MSDC module (b) shows the position offset strategy (c) shows the CEM module.
Multi-scale fusion fine-grained feature network encoder
Although traditional feature pyramid structures (e.g., FPN45 enhance semantic fusion, they exhibit feature semantic drift, insufficient receptive fields, and limited spatial interactions for small targets. To address these limitations, we propose Location-Offset Guided Heterogeneous Kernel Selection (LOHCS). This mechanism synergistically optimizes multi-scale modeling and spatial interactions through feature blending. For kernel selection, we design a Multi-scale Depth-separable Convolutional (MSDC) module46. This module employs parallel heterogeneous kernels integrating multi-scale convolutions (1 × 1, 3 × 3, 5 × 5, 7 × 7, 9 × 9). Input features are channel-wise partitioned for parallel processing. Cross-scale semantic interactions are then achieved via channel rearrangement, mixing, and shuffling operations.
Figure 4a depicts the MSDC module architecture with three key characteristics: (1) Channel splitting and parallel processing: Input features are partitioned along channel dimensions into sub-tensors. These undergo parallel processing via heterogeneous kernels for multi-receptive-field feature extraction. (2) Channel mixing: Feature mixing disrupts channel independence in deep convolutions, enhancing feature interactions. (3) Adaptive receptive fields: Kernel size variations dynamically adjust receptive fields, adapting to multi-scale defects and enhancing small-target recognition. During feature mixing, a lightweight Location Offset Feature Mixing strategy enables dynamic cross-channel information interaction. Specifically, the input feature map partitions into four sub-feature groups. Bidirectional cyclic shifts then operate along height and width dimensions. The computational formulation appears below:
![]() |
6 |
![]() |
7 |
![]() |
8 |
![]() |
9 |
In Eqs. (6)–(9),
denotes batch size,
channels,
height, and
width. The shifted sub-tensors are then concatenated along channel dimensions. The computational formulation follows:
![]() |
10 |
Equation (10) defines
as a cyclic shift along specified dimensions, where
and
indicate forward and reverse directions respectively. This operation directs channels toward complementary spatial regions without adding parameters, effectively mitigating traditional convolutional receptive field fixation. Feature mixing enables low-cost redistribution, creating diverse spatial contexts for heterogeneous kernels. This enhances discriminative power for multi-scale features, with the Location Offset Heterogeneous Kernel Selection (LOHCS) illustrated in (Fig. 4b). This synergistic optimization significantly enhances network robustness in fine-grained surface defect extraction.
Multi-scale fusion fine-grained feature network encoder
While mainstream Transformer-based encoders employ stacked layers for feature modeling, inconsistent semantic abstraction levels across layers induce conflicts during direct fusion. Shallow features preserve rich details but contain high-frequency noise, whereas deep features enable global abstraction yet suffer severe fine-grained information attenuation. To address these issues, we propose the Channel Cross-Embedding Mechanism (CEM)47. It achieves cross-layer semantic consistency by fusing multi-scale features through uniform channel embedding, significantly reducing computational complexity. The structure is illustrated in (Fig. 4c). CEM establishes a multi-scale embedding layer that converts encoder features
into unified region tokens. The computational formulation is defined as:
![]() |
11 |
where
denotes the down-sampling rate. Cross-scale concatenation produces
. This assigns the current layer feature
as
, and
as
and
. The formulation is defined as:
![]() |
12 |
Cross-channel attention weights are then computed and integrated, as defined in the following formulation:
![]() |
13 |
Finally, residual connections and layer normalization enhance feature stability and expression consistency. The formulation is defined as:
![]() |
14 |
Within the encoder system, this mechanism preserves semantic diversity across feature levels while constructing a unified cross-resolution representation space. It shifts computation from spatial to channel dimensions, significantly reducing computational complexity. This approach enables efficient cross-channel modeling while mitigating three issues in traditional encoders: semantic inconsistency, inefficient global dependency construction, and feature noise propagation, enabling lightweight real-time object detection frameworks.
Up-sampling strategy based on gating mechanism
Maintaining spatial resolution while preserving target edge clarity and semantic consistency remains a key challenge during feature fusion in high-precision defect detection. Traditional up-sampling methods (e.g., linear interpolation48 and nearest-neighbor difference49 often produce fuzzy details and exhibit poor context adaptation, hindering balanced feature restoration and robust modeling. To address this, we propose AtUP: a gating mechanism-based multi-branch dynamic up-sampling strategy. This approach enhances feature detail expression via structural attention allocation while explicitly suppressing background noise propagation. Figure 5 illustrates the overall architecture.
Fig. 5.
AtUP model structure diagram.
AtUP employs a three-branch parallel architecture to construct the up-sampling path: Main branch: Uses transposed convolution to extend spatial resolution while reconstructing local structures. Auxiliary branch: Applies linear interpolation to enhance edge continuity. Gated path: Extracts scene context semantics via average pooling to guide feature response weighting. Computational formulas for the gated path are provided below.
![]() |
15 |
The G-features undergo channel dimension compression via lightweight convolution. HardSigmoid activation is then applied to generate a [0,1]-valued gating mask, computed as:
![]() |
16 |
This step enables lightweight dynamic weight assignment, generating quasi-binarized weights to increase inter-class separability. The gating mask M is broadcasted across spatial dimensions and multiplied element-wise with up-sampled features, suppressing noisy regions while enhancing critical areas. This operation is computed as:
![]() |
17 |
The gating path mitigates detail blurring induced by transposed convolution through noise suppression and edge preservation via gated weighting. Additionally, it dynamically adapts to complex scenes by integrating gating signals with global context, while reducing redundant signals through selective weighting. The AtUP structure embeds lightweight attention into the up-sampling path via a three-branch weight regulation mechanism. This enhances feature detail modeling while preserving model deployability.
Experimental
Data set configuration
To comprehensively evaluate the generalization capability, robustness, and industrial applicability of the proposed FF-MDE metal defect detection model, we utilize two challenging industrial vision benchmarks: NEU-DET and GC10-DET. The NEU-DET dataset from Northeastern University contains six typical hot-rolled steel strip surface defects (Fig. 6). This dataset comprises 1,800 200 × 200 grayscale images exhibiting complex backgrounds, blurred defect boundaries, low contrast, small targets, and camouflage effects. We perform pixel-level statistical analysis on all defect regions, with aspect ratio distribution correlations plotted in (Fig. 8a). The dataset is partitioned into training/validation/test sets (8:1:1 ratio).
Fig. 6.

NEU-DET dataset with (a) inclusion (b) patches (c) pitted surface (d) crazing (e) rolled-in-scale (f) scratches.
Fig. 8.
Pixel-level statistics of correlation between length and width, (a) for NEU-DET dataset (b) for GC10-DET dataset.
Collected from an operational steel plate production line, GC10-DET simulates industrial macroscopic inspection tasks. It covers ten defect types arising from various processing techniques (Fig. 7). Comprising 2294 high-resolution images (2048 × 1000 pixels), this dataset exhibits large-scale size variations, diverse defect morphologies, and extreme illumination changes. These challenges make it ideal for evaluating practical applicability and localization precision in complex industrial environments. To quantify small-defect handling capability, we perform pixel-level statistical analysis on defect regions. Correlation results are shown in (Fig. 8b). Using an 8:1:1 split ratio, the dataset is partitioned into training (1,836 images), validation (229 images), and test sets (229 images).
Fig. 7.
GC10-DET dataset, (a) for punching hole (b) for welding line (c) for crescent gap (d) for water spot (e) for oil spot (f) for silk spot (i) for inclusion (j) for rolled pit (k) for crease (l) is waist folding.
Experimental environment and parameter configuration
To ensure training stability and result reproducibility, all experiments were conducted in a standardized environment. OS: Windows 11; DL framework: PyTorch 2.1.0; Language: Python 3.10.15; CUDA/cuDNN: 11.8/8.7.0; CPU: AMD Ryzen 9 7945HX (64-bit); GPU: NVIDIA GeForce RTX 4060 Laptop. No pretrained weights were loaded during training to ensure fair comparison of structural modules. We employ the AdamW optimizer with hyperparameters specified in (Table 1).
Table 1.
Table of training parameter settings.
| Parameter name | Parameter value |
|---|---|
| Learning rate | 0.0001 |
| Epochs | 100 |
| Batch size | 4 |
| Workers | 4 |
| Weight decay | 0.0001 |
| Warmup epochs | 2000 |
| Warmup momentum | 0.8 |
| Momentum | 0.937 |
Evaluation indicators
To evaluate the proposed model’s performance, this study employs precision. Precision quantifies the proportion of true positives among predicted positive samples and is defined as:
![]() |
18 |
Recall measures the proportion of actual positive samples correctly identified. It is defined as:
![]() |
19 |
![]() |
20 |
![]() |
21 |
In Eq. (21):
number of defect categories (6 for NEU-DET; 10 for GC10-DET);
average precision for category
;
mean average precision at IoU = 0.5;
mean average precision at IoU = 0.75;
mean average precision over IoU thresholds 0.5 to 0.95. Object size-specific metrics:
for objects < 32² pixels;
for 32² ≤ area ≤ 96² pixels;
for objects > 96² pixels. Additionally, computational efficiency metrics are evaluated50: FPS frames processed per second; GFLOPs: giga floating-point operations; Model size (MB): parameter storage requirement.
Ablation experiments and module performance analysis
To assess individual module contributions and their interactions in the FF-MDE framework, we conduct ablation studies on both NEU-DET and GC10-DET datasets. Performance is evaluated across multiple metrics: Tables 2 and 3 quantitatively demonstrate module-specific enhancements in: Detection accuracy, Computational efficiency, Model compactness.
Table 2.
Comparison of ablation experiment results for NEU-DET dataset.
| Test No |
MFFN | BBDES | AIFITSSA | AtUP | Evaluation metrics | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CEM | LOHCS | P(%) | R(%) | mAP50(%) | mAP50:95(%) | Params(M) | FLOPs(G) | FPS | ||||
| 1 | 64.6 | 59.5 | 66.2 | 40.9 | 19.9 | 57.0 | 82.6 | |||||
| 2 | ✓ | 66.5 | 62.1 | 66.7 | 41.1 | 29.5 | 57.2 | 81.4 | ||||
| 3 | ✓ | 65.4 | 61.3 | 67.1 | 40.6 | 18.8 | 54.7 | 59.2 | ||||
| 4 | ✓ | ✓ | 66.0 | 62.4 | 67.9 | 40.7 | 28.7 | 53.8 | 56.6 | |||
| 5 | ✓ | 63.0 | 62.7 | 66.7 | 41.7 | 19.9 | 57.0 | 86.1 | ||||
| 6 | ✓ | 67.4 | 63.6 | 68.2 | 41.2 | 19.7 | 57.1 | 82.5 | ||||
| 7 | ✓ | 62.3 | 64.1 | 68.3 | 39.6 | 20.5 | 60.6 | 82.4 | ||||
| 8 | ✓ | ✓ | ✓ | 67.7 | 65.1 | 68.1 | 42.2 | 28.7 | 53.8 | 78.6 | ||
| 9 | ✓ | ✓ | ✓ | 71.6 | 63.2 | 68.6 | 41.7 | 28.5 | 53.9 | 61.3 | ||
| 10 | ✓ | ✓ | ✓ | 66.8 | 63.3 | 69.0 | 42.6 | 28.9 | 55.3 | 52.1 | ||
| 11 | ✓ | ✓ | ✓ | ✓ | 68.9 | 64.7 | 69.2 | 42.2 | 28.8 | 55.4 | 60.1 | |
| 12 | ✓ | ✓ | ✓ | ✓ | ✓ | 70.2 | 65.3 | 70.3 | 42.5 | 28.8 | 55.4 | 65.6 |
Table 3.
Comparison of ablation experiment results for the GC10-DET dataset.
| Test No |
MFFN | BBDES | AIFITSSA | AtUP | Evaluation metrics | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CEM | LOCS | P(%) | R(%) | mAP50(%) | mAP50:95(%) | Params(M) | FLOPs(G) | FPS | ||||
| 1 | 67.9 | 50.1 | 54.6 | 27.7 | 19.8 | 57.0 | 85.3 | |||||
| 2 | ✓ | 70.8 | 49.8 | 54.7 | 26.1 | 29.5 | 57.3 | 82.4 | ||||
| 3 | ✓ | 71.5 | 48.5 | 54.8 | 26.9 | 18.8 | 54.8 | 61.2 | ||||
| 4 | ✓ | ✓ | 76.3 | 52.3 | 55.7 | 27.0 | 28.7 | 53.8 | 56.6 | |||
| 5 | ✓ | 73.1 | 47.8 | 54.9 | 27.1 | 19.9 | 57.0 | 81.4 | ||||
| 6 | ✓ | 68.9 | 54.2 | 55.1 | 27.8 | 19.7 | 57.1 | 88.8 | ||||
| 7 | ✓ | 71.4 | 53.4 | 56.3 | 28.2 | 20.5 | 60.7 | 82.0 | ||||
| 8 | ✓ | ✓ | ✓ | 76.1 | 51.2 | 56.0 | 27.6 | 28.7 | 53.8 | 63.8 | ||
| 9 | ✓ | ✓ | ✓ | 68.5 | 51.6 | 55.3 | 26.8 | 28.6 | 53.9 | 61.3 | ||
| 10 | ✓ | ✓ | ✓ | 69.9 | 52.7 | 56.8 | 27.6 | 29.0 | 55.3 | 65.2 | ||
| 11 | ✓ | ✓ | ✓ | ✓ | 74.5 | 53.9 | 57.1 | 27.4 | 28.8 | 55.4 | 63.2 | |
| 12 | ✓ | ✓ | ✓ | ✓ | ✓ | 74.9 | 54.7 | 57.9 | 28.0 | 28.8 | 55.4 | 61.1 |
Experiments show the CEM module, which integrates cross-layer feature projection and channel attention fusion, improves mAP50 by 0.5% on NEU-DET and 0.1% on GC10-DET. This result validates the module’s fine-grained perception capabilities for samples with blurred boundaries or complex textures. It effectively constructs semantically consistent multi-scale representations. The LOHCS module, which integrates spatial perturbation with kernel-scale diversity, significantly enhances small defect detection and achieves an average mAP50 gain of 0.6%. Notably, this module also reduces FLOPs by 2.3G. However, this design introduces a trade-off: the LOHCS module causes a slight decrease in FPS. This suggests its spatial perturbation operations may introduce latency unfavorable for GPU parallel processing, representing a critical trade-off between detection accuracy and real-time performance. When co-embedded, the combined CEM + LOHCS approach demonstrates strong structural complementarity, yielding significant mAP50 improvements of 1.7% on NEU-DET and 1.1% on GC10-DET. This result validates the effectiveness of cross-scale reconstruction and spatial dynamic modulation in mitigating semantic drift and local under-response.
Deployed in the shallow encoding path, the BBDES module significantly enhances the detection of direction-sensitive defects, such as scratches. Its activation response is visualized in (Fig. 9). This module improves mAP50 by 0.5% on NEU-DET and 0.3% on GC10-DET. The module’s multi-branch reparameterization structure expands feature dimensions without adding model parameters, thereby enhancing discrimination performance in low-contrast scenarios. However, it also slightly affects inference speed, reducing the FPS to 57.0. The AIFITSSA module efficiently models long-range dependencies while controlling complexity. The module achieves a 1.3% average mAP50 gain with negligible changes in parameters and FLOPs, and its FPS remains largely unaffected. This combination demonstrates exceptional efficiency and practical value. The AtUP module introduces channel-level attention gating to enhance semantic region selectivity, achieving a 1.9% average mAP50 gain. Although this module slightly increases parameters (+ 0.6 M) and computational cost (+ 3.6G FLOPs), it remains an efficient performance enhancement component. The final FF-MDE framework integrates these optimization modules. Despite a 26–28% reduction in inference speed, the framework achieves mAP50 scores of 70.3 and 57.9% on the NEU-DET and GC10-DET datasets, respectively. These scores represent significant improvements of 4.1 and 3.3% over the baseline model, validating the overall effectiveness of the proposed approach.
Fig. 9.
Comparison of felt-wild visualization before and after the introduction of the module in the backbone network.
Figure 10 compares the mAP50 and loss function trends of the FF-MDE and baseline models during training. FF-MDE consistently achieves a significantly higher mAP50 than the baseline model. Furthermore, its curve stabilizes after 40–60 epochs, demonstrating superior convergence performance. All models exhibit stable and rapid declines in their loss curves, validating the architectural soundness and training stability of the FF-MDE framework.
Fig. 10.
Plot of the training process, (a) is the curve of mAP50 index change (b) is the curve of Loss change.
Figure 11 presents a radar chart comparing five key metrics for the baseline model RT-DETR, the FF-MDE framework, and its core components. The FF-MDE framework achieves the largest coverage area, indicating optimal overall performance across all five evaluated dimensions: Precision (P), Recall (R), mAP50, mAP75, and mAP50:95. This result confirms the effective complementarity among submodules for feature fusion, representation enhancement, and spatial modeling. Together, they establish a multi-scale defect detection system with strong generalization capabilities.
Fig. 11.
Radar plots of each module for five performance metrics, (a) for NEU-DET dataset (b) for GC10-DET.
To validate the target localization, regional response, and feature-focusing capabilities of FF-MDE, Fig. 12 presents a qualitative analysis using Grad-CAM51 heatmaps. The baseline model exhibits scattered attention responses for defects like “Crazing” and “Patches.” It is also prone to erroneous activations in background textures and irrelevant regions. In contrast, the FF-MDE model demonstrates exceptional focusing capabilities. Whether dealing with complex backgrounds, low-contrast samples, or multiple targets, FF-MDE precisely highlights discriminative defect textures while effectively suppressing background noise. This enhanced interpretability and spatial selectivity visually confirm that FF-MDE’s module integration provides a structural advantage for improving fine-grained defect modeling.
Fig. 12.
Comparison of the detection of ablation experiments with visualized thermograms.
Comparative experiments and analysis
To validate FF-MDE’s performance and robustness, we benchmark against thirteen representative models on industrial inspection datasets NEU-DET and GC10-DET. These include: two-stage detectors (Faster R-CNN52, Cascade R-CNN53, Dynamic Head54; one-stage detectors (YOLOv10-m55, GFL56, RetinaNet-R5057, RTMDet58, YOLO-X59; and transformer-based detectors (TOOD-R5060, RT-DETR-R18, DINO-4scale61, DEIM62. All models were trained and evaluated under identical hardware and training configurations to ensure fairness. Quantitative test results are presented in Table 4 (NEU-DET) and Table 5 (GC10-DET).
Table 4.
Comparison of the results of the comparison experiments of different models on the NEU-DET dataset.
| Model | Evaluation metrics | |||||||
|---|---|---|---|---|---|---|---|---|
| APsmall(%) | APmedium(%) | APlarge(%) | mAP50(%) | mAP50:95(%) | FLOPs(G) | Params(M) | Size(MB) | |
| Faster-RCNN | 32.5 | 34.4 | 49.3 | 59.1 | 36.8 | 134.7 | 41.4 | 315.6 |
| TOOD-R50 | 31.8 | 34.6 | 46.7 | 56.3 | 36.1 | 123.2 | 32.0 | 244.3 |
| Cascade-RCNN | 33.0 | 35.2 | 49.6 | 61.7 | 37.4 | 162.8 | 69.2 | 528.1 |
| Dynamic Head | 32.9 | 34.7 | 50.3 | 64.2 | 37.3 | 68.2 | 38.6 | 296.5 |
| DINO-4scale | 33.8 | 29.9 | 39.2 | 60.3 | 33.4 | 179.3 | 47.4 | 553.2 |
| GFL | 34.6 | 33.3 | 50.0 | 64.4 | 37.5 | 32.2 | 128.5 | 245.3 |
| RetinaNet-R50 | 32.7 | 30.7 | 46.7 | 56.3 | 35.1 | 129.7 | 36.4 | 277.9 |
| RTMDet | 32.1 | 34.6 | 48.9 | 64.5 | 36.7 | 8.1 | 4.9 | 80.2 |
| YOLO-X | 37.2 | 35.2 | 48.1 | 64.8 | 38.7 | 7.6 | 5.0 | 61.1 |
| YOLOv10-m | 37.5 | 38.6 | 53.0 | 69.8 | 41.2 | 64.0 | 16.5 | 16.8 |
| RT-DETR-R18 | 39.4 | 36.7 | 51.3 | 66.2 | 40.9 | 57.0 | 19.9 | 38.6 |
| DEIM | 40.4 | 35.8 | 56.7 | 65.3 | 42.7 | 24.9 | 10.2 | 157.1 |
| FF-MDE(Our) | 41.6 | 37.2 | 53.2 | 70.3 | 42.5 | 55.4 | 28.8 | 111.8 |
Table 5.
Comparison of the results of the comparison experiments of different models on the GC10-DET dataset.
| Model | Evaluation metrics | |||||||
|---|---|---|---|---|---|---|---|---|
| APsmall(%) | APmedium(%) | APlarge(%) | mAP50(%) | mAP50:95(%) | FLOPs(G) | Params(M) | Size(MB) | |
| Faster R-CNN | 30.3 | 27.6 | 19.5 | 51.0 | 26.5 | 184.3 | 41.4 | 316.1 |
| TOOD-R50 | 27.8 | 21.7 | 13.2 | 48.2 | 22.0 | 174.2 | 32.0 | 244.2 |
| Cascade-RCNN | 25.3 | 31.2 | 26.6 | 53.7 | 27.4 | 212.3 | 69.3 | 529.6 |
| Dynamic Head | 26.1 | 30.9 | 25.3 | 54.0 | 27.2 | 67.9 | 38.5 | 297.4 |
| DINO-4scale | 24.2 | 16.4 | 19.8 | 47.2 | 20.2 | 241.6 | 47.6 | 556.1 |
| GFL | 23.5 | 22.8 | 14.8 | 45.7 | 21.0 | 181.2 | 32.3 | 246.3 |
| RetinaNet-R50 | 24.7 | 18.6 | 15.9 | 52.4 | 20.2 | 184.2 | 36.5 | 278.7 |
| RTMDet | 22.6 | 28.7 | 31.8 | 54.0 | 26.7 | 8.0 | 4.9 | 81.9 |
| YOLO-X | 23.7 | 24.6 | 22.6 | 45.4 | 23.5 | 7.6 | 5.0 | 62.0 |
| YOLOv10-m | 24.6 | 30.5 | 27.0 | 56.3 | 26.9 | 64.3 | 16.1 | 17.0 |
| RT-DETR-R18 | 23.6 | 30.1 | 32.4 | 54.6 | 27.7 | 57.0 | 19.9 | 76.9 |
| DEIM | 23.8 | 29.2 | 33.8 | 54.4 | 27.8 | 24.8 | 10.1 | 156.8 |
| FF-MDE(Our) | 25.6 | 31.0 | 30.3 | 57.9 | 28.0 | 55.4 | 28.8 | 111.2 |
On the NEU-DET dataset, FF-MDE demonstrates outstanding performance across all evaluation metrics. Notably, the model excels in small object detection, achieving an APsmall of 41.6%—a score that significantly surpasses all other compared methods. For overall accuracy, FF-MDE achieves a mAP50 of 70.3%, ranking first among all models. This score surpasses the robust YOLOv10-m (69.8%) and the baseline model RT-DETR (64.5%). Under the stricter high IoU threshold (mAP50:95), FF-MDE achieves a competitive 42.5%, nearly matching the advanced multi-stage framework DEIM (42.7%). On the more challenging GC10-DET dataset, FF-MDE similarly maintains a leading position. The model achieves 57.9% mAP50, tying for first place with the DEIM framework and significantly outperforming the baseline RT-DETR (54.0%). Furthermore, FF-MDE achieves the highest mAP50:95 score at 28.0%. The model also demonstrates robust performance on the AP medium (31.0%) and AP large (30.3%) metrics. This confirms its strong feature modeling stability and generalization capabilities across multi-scale, high-resolution industrial scenarios.
Figure 13 visually compares key metrics. On both the NEU-DET and GC10-DET datasets, FF-MDE consistently leads in the metrics presented in the bar chart. This visually confirms its comprehensive advantages. Resource efficiency is paramount when evaluating practical deployment feasibility. FF-MDE achieves high accuracy while requiring only 28.8 M parameters, 55.4 GFLOPs, and a 111.8 MB model size. Compared to heavyweight models such as Cascade-RCNN (69.2 M parameters; 162.8 GFLOPs) and DINO-4scale (47.4 M parameters; 179.3 GFLOPs), FF-MDE achieves an optimal balance between accuracy and efficiency.
Fig. 13.
Histograms of the performance of different models under key metrics (a) for NEU-DET dataset (b) for GC10-DET dataset.
Figure 14 presents a qualitative comparison of detection results, visually demonstrating the architectural advantages of FF-MDE. For challenging samples—including those with complex textures, low contrast, or minute targets—FF-MDE consistently exhibits superior localization accuracy and more focused regional responses. FF-MDE achieves state-of-the-art (SOTA) performance on both industrial defect datasets. Compared to the other advanced models listed in the table, FF-MDE achieves average mAP50 improvements of 7.8% on NEU-DET and 6.0% on GC10-DET. These results confirm its strong competitive edge.
Fig. 14.
Schematic comparison of different model detection results with thermograms.
Conclusion
This paper proposes a novel FF-MDE framework designed to address three core challenges in small-scale metal surface defect detection: insufficient localization accuracy, weak fine-grained feature responses, and stringent real-time constraints. Through an ingenious modular design, FF-MDE achieves an efficient balance between feature characterization quality, scale adaptability, and computational efficiency. On the authoritative industrial inspection datasets NEU-DET and GC10-DET, FF-MDE achieves mAP50 scores of 70.3 and 57.9%, respectively, surpassing the average performance of mainstream models by 7.6 and 6.5%. The model achieves over 60 FPS inference speed on standard GPU hardware while maintaining fewer than 30 million parameters and model weights under 120 MB. This demonstrates FF-MDE’s ability to deliver high detection accuracy alongside exceptional deployment adaptability and resource efficiency.
Despite significant achievements, we recognize that efficiently deploying models to resource-constrained edge devices is a critical step toward realizing industrial applications. Therefore, we have designated this as a core research direction for the future. To achieve this goal, we will focus on exploring technical approaches such as mixed-precision training, channel pruning, and knowledge distillation for deep model compression and acceleration. This aims to transfer knowledge from high-precision models to smaller, faster models while minimizing accuracy loss. Additionally, other research directions requiring breakthroughs include constructing multi-source heterogeneous datasets to overcome data limitations and developing dual-channel adversarial feature decoupling techniques to enhance generalization capabilities. In summary, FF-MDE’s modular architecture and collaborative strategy provide a generalizable design paradigm for industrial intelligent inspection systems, delivering significant theoretical and engineering value for advancing the frontier of quality control in smart manufacturing.
Author contributions
Cheng Liu: Methodology, Writing—review & editing. Kai Chen: Software, Investigation, Writing—original draft. ZhengWei Lian: Validation, Data curation. TianXu Bai: Funding acquisition. Na Jia: Supervision, Project administration. FuJie Geng: Visualization, Investigation. NanChao Wang: Software, Validation.
Funding
This work was supported by the following grants: Central Government Guidance Fund for Local Science and Technology Development (ZY23CG18) and Jiangshan City Key Science and Technology Project (2023G04).
Data availability
The NEUDET and GC10DET datasets used in this study are publicly available on the Zenodo platform (access link: https://zenodo.org/records/16882077). The data that strongly supports the research results of this study can be obtained from the corresponding author.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Gökler, M. İ. & Ozanözgü, A. M. Experimental investigation of effects of cutting parameters on surface roughness in the WEDM process. Int. J. Mach. Tools Manuf.40 (13), 1831–1848 (2000). [Google Scholar]
- 2.Qureshi, T. et al. Graphene-based anti-corrosive coating on steel for reinforced concrete infrastructure applications: challenges and potential. Constr. Build. Mater.351, 128947 (2022). [Google Scholar]
- 3.Feng, W., Arouche, M. M. & Pavlovic, M. Influence of surface roughness on the mode II fracture toughness and fatigue resistance of bonded composite-to-steel joints. Constr. Build. Mater.411, 134358 (2024). [Google Scholar]
- 4.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521 (7553), 436–444 (2015). [DOI] [PubMed] [Google Scholar]
- 5.Tang, Y. et al. Deep learning-based phase demodulation for distributed acoustic sensor. Sci. Rep.15 (1), 29767 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Liu, W. et al. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21–37).
- 7.Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580–587). (2014).
- 8.Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448). (2015).
- 9.He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2961–2969). (2017).
- 10.Liu, W., Wang, C. & Zhang, Y. Industrial surface defect detection by multi-scale Inpainting-GAN. Visual Comput.41 (8), 5643–5660 (2025). [Google Scholar]
- 11.Li, Z., Gao, Z., Yi, H., Fu, Y. & Chen, B. Image deblurring with image blurring. IEEE Trans. Image Process.32, 5595–5609 (2023). [DOI] [PubMed] [Google Scholar]
- 12.Kingma, D. P. & Welling, M. An introduction to variational autoencoders. Found. Trends Mach. Learn.12 (4), 307–392 (2019). [Google Scholar]
- 13.Li, H., Jiang, X., Guan, B., Wang, R. & Thalmann, N. M. Multistage spatio-temporal networks for robust sketch recognition. IEEE Trans. Image Process.31, 2683–2694 (2022). [DOI] [PubMed] [Google Scholar]
- 14.Liu, C., Chen, K., Wang, N., Shi, W. & Jia N. A lightweight multi-scale feature fusion method for detecting defects in water-based wood paint surfaces. Measurement253, 117505 (2025). [Google Scholar]
- 15.Fang, J. et al. GPS Galileo BDS3 LEO uncalibrated phase delays Estimation and tight combination precise point positioning with ambiguity resolution. Sci. Rep.15 (1), 21676 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Chen, Y., Liu, H. & Liang, J. Fabric defect detection via explicit De-Background. Eng. Appl. Artif. Intell.159, 111708 (2025). [Google Scholar]
- 17.Sun, F. et al. A cascaded and aggregated transformer network for RGB-D salient object detection. IEEE Trans. Multimedia. 26, 2249–2262 (2023). [Google Scholar]
- 18.Hu, X., Zhang, X., Wang, F., Sun, J. & Sun, F. Efficient camouflaged object detection network based on global localization perception and local guidance refinement. IEEE Trans. Circuits Syst. Video Technol.34 (7), 5452–5465 (2024). [Google Scholar]
- 19.Xiao, H. et al. WFF-Net: trainable weight feature fusion convolutional neural networks for surface defect detection. Adv. Eng. Inform.64, 103073 (2025). [Google Scholar]
- 20.Hu, X., Sun, F., Sun, J., Wang, F. & Li, H. Cross-modal fusion and progressive decoding network for RGB-D salient object detection. Int. J. Comput. Vision. 132 (8), 3067–3085 (2024). [Google Scholar]
- 21.Huang, J., Hong, C., Xie, R., Ran, L. & Qian, J. A simple and efficient channel MLP on token for human pose Estimation. Int. J. Mach. Learn. Cybernet.16 (5), 3809–3817 (2025). [Google Scholar]
- 22.Wang, J. et al. Depth-assisted semi-supervised RGB-D rail surface defect inspection. IEEE Trans. Intell. Transp. Syst.25 (7), 8042–8052 (2024). [Google Scholar]
- 23.Lu, M., Sheng, W., Zou, Y., Chen, Y. & Chen, Z. WSS-YOLO: an improved industrial defect detection network for steel surface defects. Measurement236, 115060 (2024). [Google Scholar]
- 24.Xie, W., Ma, W. & Sun, X. An efficient re-parameterization feature pyramid network on YOLOv8 to the detection of steel surface defect. Neurocomputing614, 128775 (2025). [Google Scholar]
- 25.Liang, F. et al. HDFA-Net: A high-dimensional decoupled frequency attention network for steel surface defect detection. Measurement242, 116255 (2025). [Google Scholar]
- 26.Luo, Q. et al. Camouflaged defect detection network for steel surface. IEEE Trans. Instrum. Meas.73, 1–13 (2023). [Google Scholar]
- 27.Zhang, Y., Wang, H., Shen, W., Peng, G. & DuAK Reinforcement learning-based knowledge graph reasoning for steel surface defect detection. IEEE Trans. Autom. Sci. Eng. (2023).
- 28.Hu, X. et al. Steel surface defect detection based on self-supervised contrastive representation learning with matching metric. Appl. Soft Comput.145, 110578 (2023). [Google Scholar]
- 29.Zhang, W., Yang, J., Bo, X. & Yang, Z. A dual attention mechanism network with self-attention and frequency channel attention for intelligent diagnosis of multiple rolling bearing fault types. Meas. Sci. Technol.35 (3), 036112 (2023). [Google Scholar]
- 30.Rui, C., Wu, Z., Liu, C., Li, B. & Cheng, J. Surface defect detection of steel strip at low resolution based on SAC-YOLOv5. Meas. Sci. Technol.36 (1), 016234 (2024). [Google Scholar]
- 31.Chen, G. et al. ESDDNet: efficient small defect detection network of workpiece surface. Meas. Sci. Technol.33 (10), 105007 (2022). [Google Scholar]
- 32.Tu, Q., Zong, Y., Li, Z. & Yue, L. Application of hybrid CNN-transformer for classifying major coal mine accident hazards. Sci. Rep.15 (1), 32183 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lee, X., Hong, C., Zhang, X. & Chen, Y. DroFormer: Temporal action detection with drop mechanism of attention. Int. J. Mach. Learn. Cybernet.16 (11), 9413–9428 (2025). [Google Scholar]
- 34.Li, D., Jiang, C., Liang, T. & REDef -DETR: real-time and efficient DETR for industrial surface defect detection. Meas. Sci. Technol.35 (10), 105411 (2024). [Google Scholar]
- 35.Mao, H. & Gong, Y. Steel surface defect detection based on the lightweight improved RT-DETR algorithm. J. Real-Time Image Proc.22 (1), 28 (2025). [Google Scholar]
- 36.Lin, X. et al. D. EAPT: efficient attention pyramid transformer for image processing. IEEE Trans. Multimedia. 25, 50–61 (2021). [Google Scholar]
- 37.Hong, C., Chen, L., Liang, Y. & Zeng, Z. Stacked capsule graph autoencoders for geometry-aware 3D head pose Estimation. Comput. Vis. Image Underst.208, 103224 (2021). [Google Scholar]
- 38.Xie, Y., Hong, C., Zhuang, W., Liu, L. & Li, J. HOGFormer: high-order graph Convolution transformer for 3D human pose Estimation. Int. J. Mach. Learn. Cybernet.16 (1), 599–610 (2025). [Google Scholar]
- 39.Wu, D., Wu, R., Wang, H., Cheng, Z. & To, S. Real-time detection of blade surface defects based on the improved RT-DETR. J. Intell. Manuf., 1–13. (2025).
- 40.Qin, Y., Chi, X., Sheng, B. & Lau, R. W. GuideRender: large-scale scene navigation based on multi-modal view frustum movement prediction. Visual Comput.39 (8), 3597–3607 (2023). [Google Scholar]
- 41.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778). (2016).
- 42.Wan, D. et al. YOLO-MIF: improved YOLOv8 with Multi-Information fusion for object detection in Gray-Scale images. Adv. Eng. Inform.62, 102709 (2024). [Google Scholar]
- 43.Wu, Z. et al. D. Token statistics transformer: Linear-time attention via variational rate reduction. arXiv (2024).
- 44.Tan, M., Pang, R., Le, Q. V. & Efficientdet Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10781–10790) (2020).
- 45.Lin, T. Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2117–2125). (2017).
- 46.Rahman, M. M., Munir, M., Marculescu, R. & Emcad Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11769–11779) (2024).
- 47.Wang, H., Cao, P., Wang, J. & Zaiane, O. R. Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 3, pp. 2441–2449) (2022).
- 48.Kirkland, E. J. & Kirkland, E. J. Bilinear interpolation. Adv. Comput. electron. Microsc. 261–263. (2010).
- 49.Koester, E. & Sahin, C. S. A comparison of super-resolution and nearest neighbors interpolation applied to object detection on satellite data. arxiv (2019).
- 50.Muksimova, S., Valikhujaev, Y., Umirzakova, S., Baltayev, J. & Cho, Y. I. GazeCapsNet: A lightweight gaze Estimation framework. Sensors25 (4), 1224 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (pp. 618–626). (2017).
- 52.Ren, S., He, K., Girshick, R., Sun, J. & Faster, R-C-N-N. Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39 (6), 1137–1149 (2016). [DOI] [PubMed] [Google Scholar]
- 53.Cai, Z., Vasconcelos, N. & Cascade, R-C-N-N. High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell.43 (5), 1483–1498 (2019). [DOI] [PubMed] [Google Scholar]
- 54.Dai, X. et al. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7373–7382). (2021).
- 55.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Adv. Neural. Inf. Process. Syst.37, 107984–108011 (2024). [Google Scholar]
- 56.Li, X. et al. Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. Adv. Neural. Inf. Process. Syst.33, 21002–21012 (2020). [Google Scholar]
- 57.Lin, T. Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988) (2017).
- 58.Lyu, C. et al. Rtmdet: an empirical study of designing real-time object detectors. Arxiv (2022).
- 59.Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. Yolox: Exceeding yolo series in 2021. Arxiv (2021).
- 60.Feng, C., Zhong, Y., Gao, Y., Scott, M. R. & Huang, W. Tood: Task-aligned one-stage object detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 3490–3499). (IEEE Computer Society, 2021).
- 61.Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9650–9660). (2021).
- 62.Huang, S. et al. Detr with improved matching for fast convergence. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 15162–15171). (2025).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The NEUDET and GC10DET datasets used in this study are publicly available on the Zenodo platform (access link: https://zenodo.org/records/16882077). The data that strongly supports the research results of this study can be obtained from the corresponding author.


































