Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 5;16:7334. doi: 10.1038/s41598-026-37780-9

Multi-scale defect detection technology for wind turbine blade surfaces based on the SASED-YOLO algorithm

Feiyang Lv 1,3, Rugang Wang 1,3,, Yuanyuan Wang 1,3, Feng Zhou 1,3, Xuesheng Bian 1,3, Naihong Guo 2
PMCID: PMC12923570  PMID: 41644670

Abstract

In the process of wind turbine blade defect detection, to address the challenges of extracting fine-grained features and inaccurate positioning due to blurred defect textures and large-scale differences, this paper proposes a wind turbine blade defect detection algorithm (SASED-YOLO), which integrates a collaborative attention mechanism and multi-scale feature space pooling. First, a collaborative attention mechanism (CADP-SCSA) is designed and incorporated into the feature extraction network to minimize interference from complex backgrounds, effectively enhancing the extraction of multi-scale features within the global context, and improving localization accuracy. Second, a multi-scale feature space pooling module (SPPSCCAP) is designed to enhance the processing and fusion of fine-grained, multi-scale defect features on wind turbine blades. The C2f-SENetv2 module is employed to enhance the representation of features across different channels. Finally, an adaptive slice convolution module (FADown) is designed to effectively reduce information loss during the sampling process. Experimental analysis is performed on the self-constructed wind turbine blade dataset (WTBD818-DET). The proposed algorithm achieved a mean average precision (mAP) of 87.7%, which is 10.5% higher than YOLOv8s, and outperforms mainstream detection algorithms such as RT-DETR, YOLOv11s, and Mamba. The experimental results indicate that the algorithm maintains high performance in detecting multi-scale wind turbine blade defects.

Keywords: Wind turbine blade, Defect detection, Attention mechanism, Multi-scale feature space pooling, YOLOv8s

Subject terms: Electrical and electronic engineering, Wind energy

Introduction

Looking ahead, a global consensus has emerged to enhance green and low-carbon development mechanisms and expedite the planning and construction of a new energy system. Therefore, the wind power industry demonstrates promising development prospects. With the continuous innovation of technology and the improvement of the industrial chain, floating offshore wind power is steadily moving from the experimental stage to commercial application. As one of the most critical components of wind turbines, the reliable operation of wind turbine blades is paramount to ensuring the stable performance of wind turbines1,2. Most wind farms are located in regions with harsh environments and strong winds. Due to prolonged exposure to these extreme conditions, wind turbine blades are susceptible to lightning strikes, cracks, acid rain erosion, breakage, oil corrosion, and other forms of damage, all of which affect power generation efficiency and service life. Therefore, the safety monitoring of wind turbine blades is a critical task. Currently, commonly used detection methods for defects in wind turbine blades include ultrasonic testing, infrared thermography, and acoustic emission monitoring. Among these methods, the detection accuracy significantly decreases when the surface of the wind turbine blade is uneven or coated. To improve detection accuracy, researchers have proposed a wind turbine blade defect detection technology based on deep learning, which has made significant advancements in this field318.

Based on existing research, the YOLO object detection algorithm does not require the generation of a large number of candidate regions, thereby reducing detection time and improving the real-time performance of target detection. It offers distinct advantages in engineering applications and has made significant research progress. For example, in 2022, Xiaoxun Z and other researchers used the C3TR module to replace the C3 module in YOLOv5s, improving the model’s ability to recognize crack targets with lighter colors or lower resolutions19.Although the model performed well in detecting cracks in wind turbine blades, the detection categories were limited to crack defects and were subject to certain restrictions. In 2022, Ran X and other researchers proposed the AFB-YOLO algorithm to address the limitations of YOLOv5s in feature fusion and feature extraction. By incorporating a bidirectional feature pyramid network (BiFPN) and a coordinate attention mechanism (CA), the detection performance of small target defects was significantly improved20. However, detection accuracy remains insufficient. In 2023, Zhang Y and colleagues proposed the YOLOv5s_MCB algorithm. By integrating MobileNetv3, a spatial and channel attention mechanism (CBAM), and BiFPN into the YOLOv5s base network, the detection accuracy and speed for tiny defect targets were enhanced21. However, complex backgrounds in the images may still affect the model’s detection performance. In 2023, Yao Y and other researchers optimized the backbone and feature fusion networks of YOLOX to improve the model’s detection capability for small target defects22. However, the improved model employs the RepVGG network for feature extraction, which may limit its ability to extract multi-scale defect features. In 2024, Tong L and other researchers proposed the WTBD-YOLOv8 algorithm to address the issues of slow inference speed and reduced detection accuracy caused by the complex backgrounds of wind turbine blades. The algorithm replaced some C2f modules in the neck network with MHSA-C2f modules, which can effectively capture global feature information under complex backgrounds. The Mini-BiFPN network structure designed also enhances the model’s inference speed23. Based on existing research, it is evident that the robustness and detection performance of wind turbine blade defect detection algorithms can be effectively improved by using feature fusion methods based on the YOLO detection framework. Currently, the defect categories in most wind turbine blade datasets lack sufficient diversity, which reduces the detection accuracy when applied to real-world environments. Therefore, the accuracy of multi-category and multi-scale defect detection for wind turbine blades still needs to be further improved.

To address the problems of difficulty in extracting fine-grained features and inaccurate positioning caused by fuzzy texture and large-scale differences of wind turbine blade defects, this paper proposes a wind turbine blade defect detection algorithm (SASED-YOLO) that integrates a collaborative attention mechanism and multi-scale feature space pooling. Firstly, a collaborative attention mechanism module (CADP-SCSA) is designed and incorporated into the backbone network. The CSAM structure within this module employs multi-scale deep shared one-dimensional convolution to extract fine-grained information from different channels. Additionally, the DPSA module utilizes a local-global channel information fusion method to capture global context and multi-scale feature information. The DPSA structure enhances the model’s ability to interpret features from different channels, alleviating semantic inconsistencies between channels and reducing interference from complex backgrounds, thereby improving the robustness of the algorithm. Secondly, a multi-scale feature space pooling module (SPPSCCAP) is introduced, integrating the ScConv module42 and the global perception field branch. This combination effectively expands the receptive field and strengthens the algorithm’s ability to process and fuse multi-scale wind turbine blade defect information. The SENetv2 block25 is then incorporated with the C2f module to form the C2f-SENetv2 module, which is integrated into the feature extraction network to enhance the backbone network’s ability to capture features from different channels. Finally, an adaptive slice convolution module (FADown) is designed to capture complete shallow feature information, reduce information loss during the down sampling process, and further improve the detection efficiency and positioning accuracy of wind turbine blade defects.

Relate works

Basic model of the YOLOv8 algorithm

The YOLO (You Only Look Once) algorithm series represents a family of state-of-the-art real-time object detection algorithms. Designed for efficient edge device deployment, YOLO simultaneously predicts multiple bounding boxes through its single-forward-pass architecture, enabling both rapid and precise detection. These characteristics have attracted substantial research interest in object detection applications. Through continuous iteration, YOLOv8 incorporates substantial architectural improvements including enhanced feature extraction and optimization strategies. The architecture provides model variants scaled from YOLOv8n (nano) to YOLOv8x (extra-large), with intermediate versions YOLOv8s (small), YOLOv8m (medium), and YOLOv8l (large). These variants differ in terms of parameter count, computational complexity, and channel dimensions, thereby enabling deployment optimization across diverse application scenarios. Larger-scale models, such as YOLOv8x, feature an increased number of parameters, necessitating greater GPU memory allocation, higher computational resources, and longer inference times. Consequently, these models are suboptimal for real-time detection systems2629. To balance accuracy and speed, this paper selects the YOLOv8s model as the base and optimizes it to enhance the detection accuracy for wind turbine blade defects. Figure 1 illustrates the architectural overview of YOLOv8s. The architecture consists of three main components: the backbone network, the neck network, and the detection head. The backbone network is composed of CBS modules (convolution, batch normalization, and SiLU activation), the C2f module with enhanced gradient flow, and the SPPF module, forming an efficient and lightweight CSP-based convolutional architecture. The neck network adopts an improved PAN-FPN structure, which performs bidirectional feature fusion and scale alignment across different feature levels, thereby enhancing the representation capability of the extracted target features. The detection head adopts a decoupled head structure that segregates the classification and regression tasks. It incorporates Distribution Focal Loss (DFL) to assign differential weights to each predicted bounding box category, thereby enhancing the discrimination of complex targets. A task-aligned sample matching method (Task-Aligned Assigner) optimizes sample allocation by ensuring that classification and regression tasks are well-aligned in target detection, thereby augmenting the model’s target representation capability, generalization, and accuracy.

Fig. 1.

Fig. 1

Structure of the YOLOv8 network model.

SASED-YOLO network model

Objects in natural scenes typically exhibit clear and distinguishable features, which enables YOLOv8s to demonstrate strong detection performance in such target detection tasks. However, wind turbine blade defect targets are characterized by limited texture information and complex, variable features. When using the backbone network of YOLOv8s to extract features from blade defect images, both global and local detail information may be lost, which in turn affects the detection accuracy and real-time performance. To address this issue, this paper proposes the SASED-YOLO detection algorithm, which integrates a collaborative attention mechanism and multi-scale feature space pooling. The overall structure of the algorithm is illustrated in Fig. 2. In the 1st, 3rd, 6th, and 9th layers of the backbone network, the input features are progressively downsampled through multiple Fully Adaptive Downsampling (FADown) modules. These modules not only reduce the spatial resolution of the feature map but also increase the number of channels, thereby enhancing the network’s ability to capture global features. At layers 4, 7, and 10, the feature extraction module (C2f_SENetV2) is employed to recalibrate the channel weights of shallow feature maps, thereby enabling fine-grained semantic information to be efficiently propagated across different scales and fully utilized during feature fusion and upsampling in the neck network (layers 13, 14, 17, and 23). At key layers such as the 5th, 8th, and 11th layers, the collaborative attention mechanism (CADP-SCSA) captures the dependencies between spatial and channel dimensions, accurately assigning attention weights to multi-scale targets within different feature maps. This mechanism effectively integrates feature information in the neck network. Finally, at layer 12, a multi-scale feature space pooling module (SPPSCCAP) is introduced to integrate contextual information from different receptive fields, thereby enabling the subsequent neck network to more accurately understand the relationship between defect targets and their surrounding environment, and enhancing the detection performance of multi-scale wind turbine blade defects.

Fig. 2.

Fig. 2

Structure of the SASED-YOLO network model.

Adaptive slice convolution

In the feature extraction network, standard convolution operations, when employed for down sampling the feature map, often lead to the loss of fine-grained information and an increase in computational complexity. To mitigate these issues, this study introduces an adaptive slice convolution module (FADown), the structure of which is illustrated in Fig. 3. Unlike standard convolution, the FADown module integrates both slicing and pooling mechanisms, allowing for the effective retention of contextual information during the down sampling process. Specifically, the input feature map X first undergoes an average pooling operation with both kernel size and stride set to 1, producing the smoothed feature map Inline graphic. Although this operation does not change the spatial resolution, it achieves local smoothing in the numerical domain, thereby suppressing high-frequency noise to some extent and stabilizing the input feature distribution. This provides a more reliable foundation for subsequent multi-path feature processing. Afterward, the feature map Inline graphic is split along the channel dimension into two parts, Inline graphic and Inline graphic. This process can be expressed as:

graphic file with name d33e314.gif 1

One part of the module applies the slicing method to process the feature map Inline graphic, segmenting it into four distinct regions to produce four sub-feature maps. Tiny edge information is then extracted from each of these sub-feature maps. Along the channel dimension, the sub-feature maps are merged to form the resulting output, efficiently condensing the spatial information into the channel dimension. This process aggregates detailed information from various spatial positions, thereby improving the model’s capacity to capture and represent local details. The process can be expressed as:

graphic file with name d33e323.gif 2
graphic file with name d33e327.gif 3

where ptl, ptr, pbl, and pbr represent the four different sub-regions extracted from the spatial dimensions of the input tensor.

Fig. 3.

Fig. 3

Structure of the FADown module.

The other branch employs a Inline graphic max pooling operation to process the feature map Inline graphic. Within each local region of the feature map, the maximum value is selected, thereby reducing the spatial dimensions to half of their original size. Subsequently, a Inline graphic convolutional layer is applied to further process the feature map. This approach reduces computational complexity, and the process can be summarized as follows:

graphic file with name d33e359.gif 4

In general, the FADown module structure is divided into a dual-branch structure, and then slicing and maximum pooling are performed separately, which helps the key feature combination and information interaction. As a result, the model’s ability to capture multi-scale defect features in wind turbine blades is improved.

C2f-SENetv2 module

In wind turbine blade defect detection, if the feature extraction module fails to fully utilize information across different channels, it may lead to the loss of multi-scale and critical features, thereby reducing the capability to detect complex defects. To address this issue, the C2f_SENetv2 module is introduced, and its structure is shown in Fig. 4. This module incorporates a channel-adaptive mechanism that dynamically adjusts the weights of each channel through a process of compression followed by excitation, thereby enabling the full extraction of detailed features from critical regions.

Fig. 4.

Fig. 4

Structure of the C2f-SENetv2 module.

During the compression process, the global average pooling operation reduces the spatial dimensions of the input feature map Inline graphic and transforms the spatial information of each channel into a global statistic. Specifically, the input feature map is denoted as Inline graphic. where C is the number of channels, H and W represent the height and width of the feature map respectively. The global average pooling operation compresses the input feature map into a tensor of shape Inline graphic, where each element corresponds to the global average of the respective channel. This process can be expressed as:

graphic file with name d33e409.gif 5

where H is the height of the feature map, W is the width of the feature map, and Inline graphic is the result after global average pooling.

During the excitation process, four parallel fully connected branches are used to process the results after global average pooling. Each branch processes global information to varying extents through parallel computations, aiming to improve the network’s capacity to represent features from diverse perspectives. Specifically, each fully connected branch reduces the number of input feature channels to Inline graphic, where R is a predefined reduction ratio, thus reducing computational cost while prioritizing more critical features. Subsequently, the output of each branch undergoes a nonlinear transformation through an activation function, which introduces more complex feature relationships. The output features from the four branches are then merged along the channel dimension to form an informative feature tensor, denoted as F. To enable the feature tensor F to be weighted along the channel dimension, it is mapped back to the original number of channels Inline graphic, using a fully connected layer. Finally, the resulting feature map is reshaped into the form of Inline graphic, ensuring that the transformed feature weights Inline graphic and the feature tensor Inline graphic can be multiplied element-wise. In this manner, the features of each channel are scaled according to their respective weighting coefficients, extracting key details from the target regions. This process can be mathematically expressed as follows:

graphic file with name d33e456.gif 6

where Inline graphic is the activation function, Inline graphic and Inline graphic are the weights and bias of the fully connected layer, respectively.

Collaborative attention mechanism

In wind turbine blade defect detection, the defect types exhibit significant scale variation, and the defects often present blurred textures and complex background interference, which substantially increase the difficulty of accurate localization. Existing backbone networks still have limitations in suppressing background noise and focusing on critical defect regions, making them prone to generating redundant bounding boxes and causing inaccurate localization. To address these issues, this paper proposes a collaborative attention mechanism module (CADP-SCSA), which enhances the network’s ability to respond to key regions of multi-scale and low-pixel defects, thereby improving localization accuracy. The structure of the module is illustrated in Fig. 5.

Fig. 5.

Fig. 5

Structure of the CADP-SCSA module.

Specifically, On the feature map Inline graphic, in order to significantly identify the critical regions in the image, mean pooling operations are performed on the two dimensions of the feature map width and height, respectively, to obtain Inline graphic and Inline graphic, which represent the global statistical information of the feature map in different dimensions, respectively. Subsequently, the channels of feature map Inline graphic and Inline graphic are divided into four equal sub-channels to facilitate the learning of spatial features and contextual information at different locations within the feature map. This process can be represented as:

graphic file with name d33e504.gif 7

where G is the number of channels in each segment, and Inline graphic and Inline graphic represent the i-th sub-tensors obtained by partitioning the original tensors Inline graphic along the height dimension and Inline graphic along the width dimension, respectively.

Next, deep convolutions with kernel sizes of 3, 5, 7, and 9 are applied to the independent sub-features, extracting spatial structural information at multiple scales. This process can be expressed as:

graphic file with name d33e541.gif 8

After obtaining the new sub-feature maps, the feature maps Inline graphic and Inline graphic are concatenated along the channel dimension, thereby preserving contextual information from multiple sources at the same spatial location. Finally, the concatenated feature maps are group-normalized, and Sigmoid activation function is applied to activate specific spatial regions. This process can be expressed as:

graphic file with name d33e557.gif 9

where Inline graphic is the Sigmoid activation function, and GN is the group normalization function.

Unlike the original SCSA attention mechanism, the proposed CADP-SCSA introduces an additional channel-attention branch to perform channel-wise weighting on the input feature map, thereby enhancing the representation of critical features. Specifically, the input feature map X is processed by both max pooling and average pooling operations. Max pooling focuses on regions with high activation responses and captures local salient details, whereas average pooling is used to extract global contextual information from the image. In order to ensure rich feature information for computing attention weights, the local and global features obtained from the capture are summed and fused using the fully connected layer. Subsequently, a Sigmoid activation function is applied to map the fused results to the range [0, 1], generating refined attention weights. These weights are then applied to the input feature map X, enabling the model to adaptively prioritize specific spatial locations and channels, thus obtaining the feature tensor S. Finally, the feature tensor S is concatenated with the feature tensor Inline graphic along the channel dimension to produce the feature map P. The concatenated feature map P contains richer feature information, further enhancing the model’s ability to identify important regions. This process can be expressed as follows:

graphic file with name d33e601.gif 10

In the DPSA module, although depthwise convolutions possess a certain capability in processing the features of each channel, they still exhibit limitations in capturing fine-grained information across channels, which may lead to imprecise target localization. Compared with the original SCSA module, we introduce an enhanced depthwise separable convolution module (DMSConv1d) to mitigate this issue. The DMSConv1d module consists of three key components: depthwise convolution, pointwise convolution, and a residual connection. In the depthwise convolution, the kernel is applied independently to each input channel, enabling efficient capture of spatial features while reducing computational complexity and the number of parameters. The pointwise convolution focuses on extracting channel-wise features, and the residual connection between the depthwise and pointwise convolutions enhances gradient flow during training. This improves gradient propagation, reduces training difficulties, prevents gradient explosion in deep networks, and ultimately boosts the detection accuracy of the model.

In summary, by incorporating a channel attention branch and the improved depthwise separable convolution (DMSConv1d), CADP-SCSA module significantly outperforms the original SCSA module in accurately localizing multi-scale, low-pixel defect targets.

Multi-scale feature space pooling module

Wind turbine blades often feature large-scale defective targets, which contain rich high-level semantic information. However, fixed receptive fields limit the model’s ability to effectively capture these large-scale defects and hinder the fusion of local information across multi-scale defects. To address this issue, this study designs a multi-scale feature space pooling module (SPPSCCAP), as illustrated in Fig. 6, to enhance the model’s ability to process fine-grained features of multi-scale wind turbine blade defects and to improve feature fusion.

Fig. 6.

Fig. 6

Structure of the SPPSCCAP module.

In order to obtain defect information from different receptive fields, the multi-scale feature space pooling module primarily obtains multi-level contextual information from different scales and receptive fields through three independent parallel branches. In the global receptive field branch, first, the number of channels in the feature map Inline graphic is compressed using a Inline graphic convolution, designed to help the model focus on key features and prevent excessive loss of local details during the subsequent average pooling process. The adaptive average pooling operation then extracts the global contextual information from the feature map and reduces the computational burden by reducing the spatial dimension to Inline graphic. Finally, during the up-sampling stage, the spatial dimension of the feature map is adjusted using bilinear interpolation to ensure that the output feature map from this branch matches the spatial dimensions of the feature maps from the other branches.

In the Maximum Pooling branch, the feature map is processed top-down through three layers of Inline graphic Maximum Pooling to gradually extract local details at different scales, while maintaining the ability of multi-scale feature fusion. In order to reduce feature redundancy, the ScConv layer is introduced to deeply analyze and select the input features. The ScConv consists of a spatial reconstruction unit and a channel reconstruction unit. The spatial reconstruction unit adopts the separation-reconstruction method, as shown in Fig. 7. First, the scaling factor in group normalization is utilized to evaluate the effective information present in various feature maps. The variance of the pixels in each batch and channel space is evaluated, and the learnable parameter Inline graphic is used to tune the response of the feature maps. During the separation process, a gate threshold of 0.5 is set to separate more informative feature maps from less informative ones, aligning with the spatial content. The reconstruction operation aims to reduce the spatial feature redundancy by adding more informative features with less informative features to generate more informative features. Specifically, when the feature weight exceeds the threshold, it indicates that the region has high discriminative power in semantic representation, forming the more informative weight Inline graphic; whereas for feature weights below the threshold, the less informative weight Inline graphic is obtained. Subsequently, the input feature tensor Inline graphic is multiplied with Inline graphic and Inline graphic separately, generating the high-information feature Inline graphic and the low-information feature Inline graphic. Next, to further reduce feature redundancy, a feature reconstruction strategy is applied, which involves adding the high-information feature and the low-information feature to generate a feature M with richer information, thereby reducing spatial feature redundancy. This process can be expressed as:

graphic file with name d33e675.gif 11

where Gate is the gating function, Sigmoid is the activation function, and GN is the group normalization function.

Fig. 7.

Fig. 7

Structure of the SRU.

The spatial reconstruction unit (CRU) uses a segmentation-transformation-fusion approach to reduce channel redundancy, as shown in Fig. 8. The input features are divided into two parts during the segmentation process: one part with Inline graphic channels and the other with Inline graphic channels, where Inline graphic. The two parts of the features are compressed with Inline graphic convolution respectively, and the two outputs obtained are denoted as Inline graphic and Inline graphic, and subsequently the two features obtained from the compression are subjected to Inline graphic grouped convolution and Inline graphic pointwise convolution, respectively, and the process can be expressed as:

graphic file with name d33e748.gif 12

where GWC denotes grouped convolution, and PConv denotes pointwise convolution.

Fig. 8.

Fig. 8

Structure of the CRU.

During the fusion process, the two sets of outputs are first processed using adaptive global average pooling, producing the pooling results Inline graphic and Inline graphic. Subsequently, the softmax function normalizes Inline graphic and Inline graphic, and the resulting feature weights, Inline graphic and Inline graphic, are multiplied by the corresponding output values from the convolution. Finally, the two products are summed and fused to generate the refined features Y. The process can be expressed as:

graphic file with name d33e790.gif 13

Finally, the feature vectors from each branch are concatenated along the channel dimension to enable efficient fusion of features from different scales, thereby obtaining global and local feature representations with redundant information removed. In the task of wind turbine blade defect detection, this method significantly enhances the capability for multi-scale target feature extraction and fusion.

Experiments and results analysis

Experimental dataset

In our experiments, we employed a self-collected wind turbine blade defect detection dataset and a publicly available weld defect detection dataset to validate the feasibility and generalization capability of the proposed algorithm.

Due to the lack of publicly available wind turbine blade defect dataset, a self-constructed wind turbine blade dataset (WTBD818-DET) was used in this study. The images in this dataset were collected from Roboflow official website and accurately labeled with the categories and locations of blade defects using the LabelImg tool.The dataset covers eight common blade surface defects: surface_eye, injury, crack, surface_oil, craze, Lightning-Strike, surface_attach, and surface_corrosion. The labeled categories are shown in Fig. 9. A total of 7,374 images were used in the experiment, allocated into training, validation, and test sets following an 8:1:1 ratio, which included 5,899 training images, 747 validation images, and 738 test images.

Fig. 9.

Fig. 9

Eight typical samples from the WTBD818-DET dataset.

The weld defect detection dataset is a publicly available dataset collected from the Kaggle website, used for detecting surface defects in welds. It contains three types of targets: bad weld, good weld, and defect, as shown in Fig. 10, with an image resolution of 640Inline graphic640. The dataset is divided into three subsets, consisting of 1,619 images for training, 283 images for validation, and 126 images for testing.

Fig. 10.

Fig. 10

Three typical samples from the Weld Defect dataset.

Statistics and analysis

In the WTBD818-DET dataset, the feature patterns and sizes of different types of defects vary significantly, which increases the complexity of model training to some extent. To address the diverse and complex characteristics of blade defects in this dataset, this study conducts a systematic statistical analysis of the label counts for each defect category and proposes targeted optimization strategies based on the analysis results to enhance the model’s object detection and classification performance. It is important to note that the term “number of samples” refers to the number of images, whereas the “number of labels” denotes the total number of defect instances; therefore, the total number of labels is considerably larger than the number of images.

As shown in Fig. 11, the number of defect labels in the “injury” category is relatively large, reaching 11,477 instances, whereas the “surface_eye” and “craze” categories contain only 217 and 36 labels, respectively. The features of the “craze” defects are relatively distinct; therefore, despite the small number of samples, the 36 labels are still sufficient to support effective training. Although the “injury” and “crack” categories contain 11,477 and 1,161 defect labels, respectively, their defect textures tend to be blurred, and the defect scales vary greatly. Consequently, it is necessary to enlarge the model’s receptive field to enhance its ability to extract and fuse multi-scale target features. Additionally, these damage targets contain dense, small-scale defects that are highly susceptible to interference from irrelevant background information, resulting in misdetections. To address this issue, introducing an attention mechanism can effectively focus on the target area while suppressing the influence of background noise. Meanwhile, defects such as “surface_corrosion”, “Lightning_Strike” and “surface_eye” have certain similarities in characteristics. As a result, the traditional neural network needs improvement to enhance its fine-grained recognition ability for wind turbine blade defects.

Fig. 11.

Fig. 11

Distribution of the number of defects.

To overcome the aforementioned challenges, this paper improves the model’s ability to concentrate on critical regions within the image, thereby helping the neural network to more effectively distinguish between defect classes with similar features. As a result, the improved model is not only capable of capturing large-scale macroscopic features but also adept at accurately identifying subtle local features, thus preventing misdetection or omission due to variations in defect scales.

Experimental platform and related indicators

Tables 1 and 2 present the configuration of the experimental platform and the training parameters used for each framework. To ensure fairness in the comparative experiments among different models, all baseline models and ablation models were trained using consistent parameter settings, including a fixed input resolution of 640Inline graphic640, an IoU threshold of 0.7, disabled NMS, and training from scratch without any pretrained weights. In addition, to maintain comparability in the object localization task, all models employ CIoU as the bounding box regression loss function.

Table 1.

Experimental platforms.

Configuration Model
CPU Intel(R) Core(TM) i5-12400F
GPU NVIDIA GeForce RTX 4060Ti
System Windows 10
Programming language Python 3.10.13
Frame Pytorch 2.1

Table 2.

Experimental details of the framework.

Hyperparameter Value
Batch size 32
Fixed image size 640Inline graphic640
Learning rate 0.01
Decay rate 0.0005
Momentum 0.937
Optimizer SGD
Works 8
NMS False
Epochs 1200(WTBD818-DET), 600(Weld Defect)
IoU 0.7
Patience 100
Amp True

It is important to note that within the same dataset, all models are trained with the same number of epochs to avoid deviations caused by differences in training duration. However, the two datasets used in this study exhibit distinct convergence behaviors and training stability; therefore, the number of training epochs is not unified across datasets. On the WTBD818-DET dataset, the model continues to improve over the course of 1200 epochs without signs of overfitting, and thus the training duration is set to 1200 epochs. In contrast, on the Weld Defect dataset, the model reaches stable convergence at around 600 epochs, and further training leads to performance degradation (with patience set to 100, meaning early stopping is triggered if no improvement occurs within 100 epochs). Therefore, the number of training epochs for this dataset is set to 600 to achieve better generalization performance.

Regarding data augmentation strategies, the pretrained components of the Mamba, YOLO series, and SASED-YOLO models employ Mosaic augmentation to increase sample diversity and enhance the detection performance for small objects. In contrast, the RT-DETR model follows its original Transformer-based training paradigm and does not adopt Mosaic augmentation.

In this study, to better evaluate the performance of the model, we use several key metrics—including precision (P), recall (R), average precision (AP), and mean average precision (mAP)—to comprehensively assess the model’s effectiveness in detecting wind turbine blade defects and weld defects.

First, the mean average precision (mAP) is used to assess the overall capability of the model, and a higher mAP value indicates better performance across all detection categories. The mean average precision (mAP) is calculated based on the average precision (AP) of each single class label, AP is calculated from the area under the precision and recall (P-R) curves. The expressions for P, R, AP, and mAP are as follows:

graphic file with name d33e990.gif 14

where TP(true positives) denotes the number of correctly detected wind blade defects; FN(false negatives) denotes the number of real wind blade defect targets missed in the detection; and FP(false positives) denotes the number of defects incorrectly identified as wind blade defects by the model.

graphic file with name d33e1004.gif 15
graphic file with name d33e1008.gif 16

The number of parameters (Params) in the model quantifies its spatial complexity, while the amount of computational work (GFLOPs) assesses the consumption of computational resources. FPS is used to evaluate the detection speed of the model. A higher FPS indicates that the model processes more images per second, thereby enhancing real-time performance. The formula for calculating FPS is as follows:

graphic file with name d33e1014.gif 17

where TF represents the total number of frames processed by the model, and ET is the time taken to process all the images.

Ablation study of convolutional modules

In this comparative study, all models were trained using identical hyperparameters and data augmentation strategies on the same wind turbine blade defect detection dataset, and all inference tests were conducted on an RTX 4060 Ti GPU to ensure fairness in experimental evaluation. Furthermore, all methods were built upon the same backbone architecture with a fixed input resolution of 640Inline graphic640, where only the 1st, 3rd, 6th, and 9th layers of the backbone were replaced with the convolutional modules under comparison. This design ensures strict control of variables, allowing performance differences to accurately reflect the intrinsic contribution of each module. Additionally, FPS and latency were computed without including post-processing time to maintain comparability of inference speed across models. Under these rigorous experimental settings, the proposed study reliably assesses the overall advantages of the FADown module in both detection accuracy and inference efficiency.

To comprehensively evaluate the performance of the proposed FADown module, this study includes several high-performance convolutional structures as comparison baselines, including SPD-Conv30, ODConv31, standard convolution with a stride of 2, DualConv32, RFAConv33, GhostConv34, and the MobileNetv3 backbone35. The comparison results are summarized in Table 3. Overall, the FADown module exhibits stable performance in terms of Precision and Recall, achieving an mAP@0.5 of 0.784, which surpasses all competing modules except MobileNetv3 and RFAConv. This demonstrates that FADown provides a clear advantage in detection accuracy. Although the detection accuracy of FADown is slightly lower than that of MobileNetv3 and RFAConv, it offers significant advantages in model complexity and inference efficiency. Specifically, its parameter count is only 9.99 M, substantially smaller than MobileNetv3 and RFAConv, while its FPS is markedly higher than both counterparts.

Table 3.

Ablation study of various convolutional modules.

Model P R mAP@0.5 mAP@0.5:0.95 Latency FPS Parameters GFLOPs
YOLOv8s+SPD-Conv 0.782 0.698 0.776 0.487 4.90ms 204.12 10.26M 26.4
YOLOv8s+Conv2d 0.790 0.738 0.772 0.486 5.01ms 199.48 11.13M 28.5
YOLOv8s+ODConv 0.868 0.652 0.775 0.499 6.27ms 159.47 15.86M 24.7
YOLOv8s+MobileNetv3 0.831 0.697 0.786 0.508 6.82ms 146.60 10.42M 21.9
YOLOv8s+DualConv 0.765 0.690 0.775 0.471 4.75ms 210.37 10.13M 26.0
YOLOv8s+RFAConv 0.759 0.707 0.785 0.489 8.05ms 124.18 11.18M 28.9
YOLOv8s+Ghoustconv 0.785 0.698 0.759 0.474 4.83ms 207.06 10.36M 26.6
YOLOv8s+FADown 0.767 0.768 0.784 0.503 4.88ms 204.82 9.99M 25.7

These results demonstrate that FADown is able to sustain strong detection performance while maintaining a low parameter count and high inference speed, thereby further validating its advantages in feature extraction capability and lightweight architectural design.

Comparison experiment on attention mechanisms

To verify the effectiveness of the proposed CADP-SCSA module, this section conducts comparative studies under identical experimental settings and datasets. Since the introduction of CADP-SCSA requires adjusting the repetition number of the C2f blocks in the neck network, the same network configuration is maintained for other attention-mechanism experiments to ensure fair comparison. Specifically, the repetition number of the last three C2f blocks in the neck is uniformly fixed at 1. Subsequently, several mainstream attention mechanisms are individually integrated into the YOLOv8s model and consistently embedded at the 5th, 8th, and 13th layers of the backbone. The experimental results for each standalone attention mechanism are summarized in Table 4.

Table 4.

Ablation experiments on different attention mechanisms.

Model P R mAP@0.5 mAP@0.5:0.95 Parameters GFLOPs
YOLOv8s+CBAM 0.822 0.642 0.767 0.477 11.47M 28.7
YOLOv8s+CA 0.793 0.74 0.787 0.493 11.15M 28.5
YOLOv8s+ECA 0.835 0.708 0.796 0.511 11.13M 28.5
YOLOv8s+TSSA 0.702 0.672 0.752 0.475 11.82M 29.7
YOLOv8s+EMA 0.849 0.726 0.797 0.510 11.13M 28.7
YOLOv8s+ELA 0.828 0.763 0.819 0.529 11.13M 28.5
YOLOv8s+SCSA 0.777 0.755 0.805 0.511 11.14M 28.5
YOLOv8s+CADP-SCSA 0.825 0.754 0.824 0.522 12.92M 29.2

We selected several attention mechanisms for comparison, including CBAM36, CA37, ECA38, TSSA39, EMA40, ELA41, and SCSA42. As shown in the table, the proposed CADP-SCSA module achieves the highest detection accuracy among all evaluated models, reaching an mAP@0.5 of 82.4% on the wind turbine blade defect detection dataset. This performance substantially surpasses that of the other attention mechanisms, thereby demonstrating the effectiveness and superiority of the CADP-SCSA module.

Building upon the performance improvements validated by the quantitative metrics in the previous section, this study further conducts a qualitative analysis to examine the effectiveness of the proposed CADP-SCSA module from the perspective of feature responses. To this end, we employ the LayerCAM43 technique, using the multi-level joint feature maps from the 5th, 8th, and 13th layers of the backbone as inputs to generate heatmaps during inference. This visualization enables a more intuitive assessment of how different attention mechanisms distribute their focus over surface defect regions.

As illustrated in Fig. 12, the heatmap visualization shows that the model incorporating the CADP-SCSA module exhibits more complete and continuous activation patterns within defect regions, while effectively suppressing irrelevant feature responses in complex background areas. In addition, the activations around critical defect areas are more concentrated, with the focused regions closely aligned with the true defect locations and exhibiting clear contour boundaries, thereby facilitating more accurate defect localization. Overall, these visualization results are consistent with the quantitative findings presented earlier, further confirming the effectiveness of the CADP-SCSA module in defect feature extraction and precise localization.

Fig. 12.

Fig. 12

Heatmap visualization of various attention mechanisms.

Experimenting with improved strategies

To verify the rationale for placing the improved collaborative attention mechanism module at the 5th, 8th, and 11th layers of the backbone network, as well as to demonstrate its performance advantages over the original SCSA module at these positions, a series of comparative experiments were conducted. The experimental results are presented in Table 5.

Table 5.

Comparative experiment on the effect of different positional attention mechanisms.

Model P R mAP@0.5 mAP@0.5:0.95 Parameters GFLOPs
YOLOv8s+SCSA(A) 0.776 0.755 0.805 0.514 11.14M 28.5
YOLOv8s+SCSA(B) 0.790 0.729 0.790 0.506 11.14M 28.5
YOLOv8s+SCSA(C) 0.782 0.719 0.794 0.477 11.14M 28.5
YOLOv8s+CADP-SCSA(D) 0.825 0.754 0.824 0.522 12.91M 29.9

The experiments were conducted using the basic YOLOv8s network. As shown in Table 5, incorporating the SCSA attention mechanism42 module to different parts of the network yields varying performance effects. Specifically, Method (A) involves incorporates the SCSA module to the backbone network: Method (B), to the connection between the backbone and neck networks: Method (C), between the neck network and the detection head: and Method (D), adding the CADP-SCSA module to the backbone network. In Table 5, the precision (P) and recall (R) of Method (C) are lower than those of Method (B), although the mAP@0.5 index is higher. The precision (P), recall (R), mAP@0.5, and mAP@0.5:0.95 of Method (A) outperform those of Methods (B) and (C). Therefore, integrating the attention mechanism into the primary position is the optimal solution. Additionally, compared to Method (A), the improved Method (D) demonstrates significant enhancements across various indicators, with precision (P), recall (R), mAP@0.5, and mAP@0.5:0.95 reaching 82.5%, 75.4%, 82.4%, and 52.2%, respectively. This performance improvement can be attributed to the CADP-SCSA attention mechanism module designed in this paper, which integrates both local and global channel attention mechanisms, thereby significantly enhancing the model’s ability to comprehend information from each channel.

In summary, for multi-scale defect images of wind turbine blades, this mechanism can accurately capture the salient features of the target areas while suppressing interference from complex backgrounds.

Ablation and comparative study of the SPPSCCAP module

To validate the effectiveness of the proposed SPPSCCAP module, we conducted comparative experiments against several commonly adopted enhancement modules in existing wind turbine blade defect detection studies. As shown in Table 6, when integrated into the baseline YOLOv8s model, the SPPSCCAP module achieves improvements of 3.6%, 2.4%, and 1.6% in mAP@0.5 compared with the BiFPN, MHSA-C2f, and SPPF modules, respectively. These performance gains primarily arise from the incorporation of multiple advanced feature extraction components within the SPPSCCAP module, including multi-scale pooling layers, ScConv layers, and adaptive pooling layers, which collectively expand the receptive field via cross-channel concatenation. This design not only enhances the model’s capability to extract multi-scale defect features from wind turbine blades but also improves feature fusion effectiveness. In addition, compared with the SimSPPF module, the SPPSCCAP module yields a further 1.6% improvement in mAP@0.5 on the YOLOv8 baseline. This advantage is largely attributed to the SPPSCCAP module’s use of upsampling, downsampling, and channel concatenation operations, enabling the extraction of richer feature details and significantly improving the model’s robustness and detection performance under complex background conditions.

Table 6.

SPPSCCAP module effect comparison experiment.

Model P R mAP@0.5 mAP@0.5:0.95 Parameters GFLOPs
YOLOv8s+BiFPN 0.76 0.673 0.752 0.448 10.24M 28.6
YOLOv8s+MHSA-C2f 0.769 0.705 0.764 0.470 11.33M 28.6
YOLOv8s+SPPF 0.790 0.738 0.772 0.486 11.13M 28.5
YOLOv8s+SPP 0.783 0.733 0.776 0.479 11.14M 28.5
YOLOv8s+SimSPPF 0.798 0.730 0.772 0.490 10.35M 27.6
YOLOv8s+SPPSCCAP 0.786 0.700 0.788 0.487 12.79M 29.8

In summary, in the wind turbine blade defect detection task, the SPPSCCAP module demonstrates outstanding performance due to its advanced multi-scale feature extraction and fusion capabilities.

Ablation experiment

To further validate the effectiveness of the proposed SASED-YOLO model, we conducted systematic ablation experiments on the WTBD818-DET dataset to assess the contribution of each module to the model’s overall performance. In these experiments, the original YOLOv8s served as the benchmark model, with various improvements and optimizations applied to its backbone network structure. According to the ablation experiment results in Table 7, after introducing the designed collaborative attention mechanism (CADP-SCSA) module (A), the model showed significant improvements across all performance indicators. Specifically, the precision (P) increased by 3.5%, the recall (R) by 1.6%, mAP@0.5 by 5.2%, and mAP@0.5:0.95 by 3.6%. These results confirm that the proposed CADP-SCSA module effectively detects multi-scale wind turbine blade defects and exhibits strong fine-grained information extraction capabilities.

Table 7.

Ablation experiments.

Model SCACSA SPPSCCAP C2f-SENetv2 FADown Parameters P R mAP@0.5 mAP@0.5:0.95 GFLOPs
Yolov8s 11.13M 0.790 0.738 0.772 0.486 28.5
A Inline graphic 12.91M 0.825 0.754 0.824 0.522 29.9
B Inline graphic 12.79M 0.786 0.700 0.788 0.487 29.8
C Inline graphic 11.18M 0.779 0.666 0.762 0.469 28.5
D Inline graphic 9.97M 0.767 0.768 0.784 0.503 25.7
E Inline graphic Inline graphic 14.58M 0.824 0.800 0.842 0.584 31.2
F Inline graphic Inline graphic Inline graphic 14.63M 0.825 0.798 0.839 0.551 31.2
SASED-YOLO Inline graphic Inline graphic Inline graphic Inline graphic 13.50M 0.852 0.834 0.877 0.599 28.5

Building on module (A), we optimized the original network’s SPPF module (E) by incorporating the proposed SPPSCCAP module. This integration further enhanced the model’s performance compared to the original YOLOv8s. Specifically, the improved model’s precision (P) increased by 3.4%, recall (R) by 6.2%, mAP@0.5 by 7%, and mAP@0.5:0.95 by 9.8%. These experimental results confirm the effectiveness of the multi-scale feature pooling module (SPPSCCAP) presented in this study. By expanding the receptive field, the SPPSCCAP module significantly enhances the fine-grained defect feature extraction and fusion capabilities for wind turbine blades.

Finally, based on module (E), we integrated the C2f-SENetv2 module and the FADown module to form the complete SASED-YOLO model. Compared to the original YOLOv8s, the SASED-YOLO model demonstrated substantial improvements: precision (P) increased by 6.2%, recall (R) by 9.5%, mAP@0.5 by 10.5%, and mAP@0.5:0.95 by 11.3%. These performance indicators fully demonstrate that the SASED-YOLO model proposed in this article exhibits excellent accuracy and robustness in the task of detecting defects in wind turbine blades.

Comparative experiments with mainstream algorithms

In the wind turbine blade defect detection task, to verify the superiority of the proposed SASED-YOLO algorithm, three representative categories of object detection models—RT-DETR, Mamba, and the YOLO family—were selected for comparative experiments based on the WTBD818-DET dataset. To ensure fair comparisons, all experiments were conducted under consistent settings. Specifically, for validation testing, FPS and latency were measured using a batch size of 1, considering only the preprocessing and inference stages and excluding post-processing. This approach allows the measurements to accurately reflect the intrinsic computational efficiency of each model. All validation measurements were performed on an RTX 4060 Ti GPU with a fixed input resolution of 640Inline graphic640.

As presented in Table 8, SASED-YOLO demonstrates a remarkable advantage in detection accuracy compared with the latest YOLO, RT-DETR, and Mamba algorithms. Compared with YOLOv5s, YOLOv6s, YOLOv8s, YOLOv10s, YOLOv11s, YOLOv12s, and YOLOv13s of the same model scale, the proposed algorithm achieves substantial improvements across four key evaluation metrics: precision (P), recall (R), mAP@0.5, and mAP@0.5–0.95. Specifically, relative to YOLOv8s, SASED-YOLO improves precision by 6.2%, recall by 9.6%, mAP@0.5 by 10.5%, and mAP@0.5–0.95 by 11.3%.

Table 8.

Comparison of different algorithms on the WTBD818-DET.

Method P R mAP@0.5 mAP@0.5:0.95 Latency FPS Parameters GFLOPs
YOLOv3-tiny 0.750 0.609 0.657 0.372 3.3ms 300.49 8.7M 12.9
YOLOv5s 0.735 0.739 0.77 0.475 5ms 200 9.11M 23.8
YOLOv6n 0.678 0.638 0.676 0.406 3.93ms 254.50 4.7M 11.4
YOLOv6s 0.748 0.717 0.732 0.456 5.75ms 173.71 18.5M 45.3
YOLOv8n 0.70 0.607 0.713 0.422 4.03ms 248.35 3.01M 8.2
YOLOv8s 0.790 0.738 0.772 0.486 5.01ms 199.48 11.13M 28.6
YOLOv10s 0.794 0.675 0.743 0.466 5.74ms 174.18 8.07M 24.8
YOLOv11s 0.801 0.791 0.83 0.530 5.31ms 188.25 9.42M 21.3
YOLOv12s 0.765 0.658 0.756 0.439 8.48ms 117.91 9.08M 19.3
YOLOv13s 0.884 0.714 0.808 0.529 11.25ms 88.84 9.00M 20.8
RT-DETR-18 0.768 0.628 0.687 0.401 11.86ms 84.34 19.88M 57.0
RT-DETR-34 0.737 0.688 0.705 0.389 15.54ms 64.33 31.11M 88.8
Mamba-B 0.810 0.668 0.765 0.479 33.30ms 30.03 21.83M 49.7
SASED-YOLO 0.852 0.834 0.877 0.599 14.05ms 72.08 13.5M 28.6

Furthermore, compared with RT-DETR-18 and Mamba-B, the proposed SASED-YOLO algorithm exhibits lower computational complexity while achieving significantly higher detection accuracy. SASED-YOLO attains 87.7% in mAP@0.5 and 59.9% in mAP@0.5–0.95. Relative to the latest YOLOv13s model, these values represent improvements of 6.9% in mAP@0.5 and 7.0% in mAP@0.5–0.95. In addition, the algorithm achieves a precision (P) of 85.2% and a recall (R) of 83.4%, substantially surpassing all other compared models.

The scatter plot presented in Fig. 13 illustrates the performance of different algorithms in terms of GFLOPs and mAP@0.5. As shown in the figure, the proposed SASED-YOLO algorithm achieves substantially higher detection accuracy under comparable computational complexity. Specifically, compared with algorithms such as YOLOv10s, YOLOv11s, YOLOv12s, and YOLOv13s, SASED-YOLO attains a significant improvement in mAP@0.5 with only a marginal increase in GFLOPs, which strongly demonstrates the superior detection capability of the proposed method.

Fig. 13.

Fig. 13

mAP@0.5 and GFLOPs values for different models.

Although the SASED-YOLO model achieves remarkable improvements in detection accuracy, its inference speed is reduced compared with most real-time models. Specifically, relative to the latest YOLOv13s, SASED-YOLO introduces an additional 4.5M parameters and increases FLOPs by 7.8G, while its FPS decreases by 12.26, indicating inferior real-time performance compared with other models of similar scale. This degradation primarily stems from the increased network depth, which rises from 225 layers in YOLOv8s to 468 layers in SASED-YOLO, as well as the integration of channel attention, spatial feature reorganization, and channel-dimension concatenation operations. These structural designs introduce a large number of small operators within the network, resulting in more frequent GPU kernel scheduling and intensified data exchange between operators. Consequently, the memory read/write overhead increases, the GPU processing speed decreases, and the actual per-frame forward inference time becomes significantly longer, ultimately affecting the overall real-time performance of the model.

Comparison of actual detection effects of different algorithms

This study conducts a comprehensive performance comparison between the YOLOv8s and SASED-YOLO models under the same experimental conditions described in Section 3.3. The performance results are shown in the P-R curves in Figs. 14 and 15. These figures illustrate the trends in precision (P) and recall (R) for defect detection across different categories of wind turbine blades. Further quantitative analysis reveals that the improved SASED-YOLO algorithm increases mAP@0.5 from 77.2% to 87.7%, a significant improvement of 10.5%. Additionally, for the fine-grained defect types targeted in this study (such as surface_eye, injury, crack, surface_oil, Lightning-Strike, surface_attach, and surface_corrosion), the SASED-YOLO model demonstrates a significant improvement in the AP index compared to the YOLOv8s model. The AP scores for these defect types increased from 83.4%, 60.1%, 69.3%, 59.4%, 85.4%, 78.2%, and 82.2% to 96.3%, 71.4%, 82.4%, 72.6%, 98.4%, 89.4%, and 91.4%, respectively. These results robustly demonstrate the effectiveness of the improved strategy. Specifically, in detecting fine-grained defects in wind turbine blades under multi-scale and complex background conditions, the SASED-YOLO model exhibits significant performance advantages.

Fig. 14.

Fig. 14

P-R curve of the YOLOv8s algorithm.

Fig. 15.

Fig. 15

P-R curve of the SASED-YOLO algorithm.

In order to provide a more intuitive comparison of detection performance before and after model improvements, this study selects 8 sets of images from different defect categories in the WTBD818-DET test set. These images were input into the YOLOv8s, YOLOv11s, RTDETR-34, and SASED-YOLO models for defect detection. The results of this comparison are shown in Fig. 16.

Fig. 16.

Fig. 16

Comparison of detection effects of four algorithms.

For the detection of surface_eye defects, both YOLOv8s and YOLOv11s show incomplete detection. Although RTDETR-34 detects all defects, its accuracy is lower than that of SASED-YOLO. In the case of injury defects, only SASED-YOLO successfully detects all instances, while the other three models miss some defects. When detecting cracks, YOLOv8s and YOLOv11s produce incomplete results, with lower detection accuracy compared to SASED-YOLO. For surface_oil defects, SASED-YOLO significantly outperforms the other models in terms of detection accuracy, while RTDETR-34 exhibits notable false positives (indicated by yellow arrows). In the detection of craze and lightning-strike defects, SASED-YOLO outperforms both RTDETR-34 and YOLOv11 in accuracy. For surface_attachment defects, SASED-YOLO demonstrates much higher detection accuracy than the other three models. Lastly, for surface-corrosion defects, RTDETR-34 shows false detections (marked with orange arrows), while SASED-YOLO achieves higher accuracy than YOLOv8s and YOLOv11, with more precise defect identification.

Overall, SASED-YOLO demonstrates superior detection accuracy, as well as more precise target localization and recognition capabilities in multi-scale wind turbine blade defect detection tasks.

Comparative analysis of algorithm performance for weld defect detection

To systematically assess the efficacy and cross-domain generalization of the SASED-YOLO model, experiments were further conducted on a welding defect detection dataset characterized by substantial illumination variations, multi-scale defects, and complex backgrounds. This dataset, obtained from the publicly available Kaggle welding defect dataset, provides a reliable benchmark for evaluating the model’s adaptability across domains. For comparative analysis, representative object detection models, including YOLOv11s, YOLOv12s, YOLOv13s, and RT-DETR, were selected. All experiments were conducted under identical training and evaluation settings, strictly adhering to the experimental parameters and metrics outlined in Section 4.3, thereby ensuring a fair and controlled comparison. The results of these experiments are summarized in Table 9.

Table 9.

Comparison experiments on the Weld Defect dataset.

Method P R mAP@0.5 mAP@0.5:0.95 Latency FPS Parameters GFLOPs
YOLOv5s 0.768 0.709 0.747 0.518 4.82ms 207.40 9.11M 23.8
YOLOv8s 0.773 0.712 0.748 0.531 5.07ms 197.07 11.13M 28.6
YOLOv10s 0.720 0.708 0.722 0.49 6.00ms 166.63 8.07M 24.8
YOLOv11s 0.789 0.69 0.748 0.528 5.41ms 184.70 9.42M 21.3
YOLOv12s 0.763 0.697 0.734 0.512 8.08ms 123.83 9.08M 19.3
YOLOv13s 0.767 0.72 0.729 0.520 11.13ms 89.85 9.00M 20.8
RTDETR-18 0.765 0.71 0.684 0.467 12.10ms 82.59 19.88M 57.0
RTDETR-34 0.776 0.688 0.675 0.459 14.43ms 69.28 31.11M 88.8
SASED-YOLO 0.793 0.726 0.768 0.553 14.27ms 70.07 13.49M 28.6

In this cross-domain detection task, the proposed SASED-YOLO model achieved improvements of 2.0% and 1.9% over the original YOLOv8s in terms of mAP@0.5 and mAP@0.5:0.95, respectively. Compared with the currently top-performing YOLOv13s, SASED-YOLO demonstrates a 3.9% increase in mAP@0.5 and a further 3.3% improvement in mAP@0.5:0.95. These results indicate that SASED-YOLO exhibits strong generalization capability and robust detection performance in cross-domain welding defect detection, thereby validating the effectiveness of the proposed approach in complex scenarios.

To more intuitively demonstrate the performance advantages of the proposed SASED-YOLO model in the welding defect detection task, several sample images were randomly selected from the validation set for detection visualization. The comparison models include YOLOv8s, YOLOv10s, YOLOv11s, YOLOv12s, YOLOv13s, as well as the proposed SASED-YOLO. All models were evaluated under identical inference settings, and the visualization results are presented in Fig. 17.

Fig. 17.

Fig. 17

Comparison of detection effects of six algorithms.

From Sample 1, it is evident that YOLOv8s and YOLOv11s produce redundant bounding boxes, while YOLOv10s suffers from missed detections. Sample 2 shows that YOLOv8s and YOLOv13s fail to detect certain defects, and YOLOv12s yields incorrect predictions. In Sample 3, all comparison models except the proposed SASED-YOLO exhibit misclassification issues; YOLOv10s completely fails to detect the target, YOLOv8s generates false positives, and YOLOv12s and YOLOv13s produce redundant bounding boxes. In Sample 5, YOLOv8s, YOLOv11s, YOLOv12s, and YOLOv13s all show missed detections, while YOLOv8s and YOLOv11s additionally produce false positives. In Sample 6, only the proposed SASED-YOLO successfully identifies the defect location.

In summary, the visualization results demonstrate that the proposed SASED-YOLO algorithm can more accurately identify defect targets in welding inspection tasks, effectively reducing missed detections, false positives, and redundant bounding boxes. Compared with other mainstream models, SASED-YOLO exhibits a clear advantage in detection performance.

Conclusion

This paper introduces the SASED-YOLO target detection algorithm, which is designed to efficiently and accurately detect and identify multi-scale, fine-grained defects on wind turbine blade surfaces. First, by embedding the CADP-SCSA attention mechanism into the backbone network, the algorithm’s ability to extract multi-scale features is enhanced, while background interference is reduced. Second, the introduction of the SPPSCCAP module expands the receptive field, thereby enhancing the detection of multi-scale defects and facilitating the fusion of fine-grained features. Next, the C2f-SEnetv2 module is integrated to enhance the algorithm’s ability to represent features across different channels. This integration highlights key features, suppresses redundant information, and significantly reduces errors in defect identification. Finally, the FADown module is designed to reduce the computational complexity of the model, thereby enabling shallow-layer features to retain more contextual, fine-grained information. Experimental results indicate that, although the number of parameters in the SASED-YOLO algorithm increases slightly, the mAP@0.5 value improves by 10.5%. This significant increase in detection accuracy demonstrates the model’s effectiveness in detecting surface defects on wind turbine blades. The model achieves high performance in accurately identifying multi-scale and multi-category defect targets.

Cross-dataset evaluation on the publicly available welding defect dataset demonstrates the strong generalization capability of SASED-YOLO, which outperforms the original YOLOv8s in mAP@0.5 and mAP@0.5:0.95, and further surpasses the state-of-the-art YOLOv13s, indicating robustness and effectiveness in cross-domain scenarios.

However, the SASED-YOLO algorithm exhibits certain limitations in real-time performance, as its single-frame inference speed is noticeably lower compared with other real-time models of the same scale. Future work will focus on optimizing the network architecture and computational efficiency, including techniques such as model pruning and knowledge distillation, to further enhance real-time performance while maintaining high detection accuracy.

Author contributions

R.W. is responsible for the conceptual design, funding acquisition, supervision, and validation of the project. F.L. is responsible for the collection of experimental data sets, the construction of innovative algorithms, the conduct of experiments, the collection and labeling of experimental data, and the writing of papers. Y.W. is responsible for the validation of the article data. F.Z. is responsible for the validation of the article data. X.B. is responsible for the validation of the article data. N.G. is responsible for the validation of the article data.

Funding

This work was supported in part by the Jiangsu Graduate Practical Innovation Project under Grant (SJCX24_2152), in part by the Natural Science Foundation of China under Grant (62301473), in part by the Major Project of Natural Science Research of Jiangsu Province Colleges and Universities under Grant (19KJA110002).

Data availability

The WTBD818-DET dataset, along with the corresponding source code and trained model weights, is available from the corresponding author (Rugang Wang, email: wrg3506@ycit.edu.cn) upon reasonable request. In addition, the welding defect detection dataset used for comparative experiments is publicly available at https://www.kaggle.com/datasets/sukmaadhiwijaya/welding-defect-object-detection/data.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Chen, S., Xu, L. & Wu, H. Theoretical interpretation and realization path of green low-carbon transformation of China’s industrial chain under the “dual carbon’’ goal. Soc. Sci. Guangdong.5, 63–74 (2024). [Google Scholar]
  • 2.Hu, B., Wu, Y. & Guo, Z. Review of wind turbine blade crack monitoring technology. High Voltage Electric. Equipment.58(7), 93–100 (2022). [Google Scholar]
  • 3.Wu, B. et al. Surface damage detection and identification method of wind turbine blades based on improved YOLOv8. J. Mech. Electric. Eng.41(7), 1260–1268 (2024). [Google Scholar]
  • 4.Wang, Y. et al. Fault diagnosis of wind turbine blades based on CNN-BiGRU. J. Inner Mongolia Univ. Sci. Technol.41(2), 173–179 (2022). [Google Scholar]
  • 5.Dong, L., Han, Z., Wang, N. et al. Crack defect analysis of wind turbine blade based on deep learning algorithm. Comput. Measure. Control. 30(8), 142–146+154 (2022).
  • 6.Xue-Li, S. H. E. N., Ying, Y. A. N. G., Xin-Yu, Q. I. N. & Hui, Y. U. Icing Fault Diagnosis ofWind Turbine Blades Based on Residual Neural Network# br. Noise Vibration Control.42(1), 79 (2022). [Google Scholar]
  • 7.Li, Y. et al. Defect detection algorithm technology for the blade of wind turbine based on Yolov4. Chinese J. Turbomachinery.64(01), 46–53 (2022). [Google Scholar]
  • 8.Hao, W. Li, J. Research on the defect detection of fan blades based on YOLOx. Comput. Era. 09, 106-110+115 (2023).
  • 9.Li, B. et al. Surface defect detection algorithm of wind turbine blades based on HSCA-YOLOv7. Electric. Power.56(10), 43–52 (2023). [Google Scholar]
  • 10.Ma, L., Jiang, X., Tang, Z., Zhi, S. & Wang, T. Wind turbine blade defect detection algorithm based on lightweight MES-YOLOv8n. IEEE Sens. J.24(17), 28409–28418 (2024). [Google Scholar]
  • 11.Liu, Y. H. et al. Defect detection of the surface of wind turbine blades combining attention mechanism. Adv. Eng. Inf.59, 102292 (2024). [Google Scholar]
  • 12.He, Y. et al. An adaptive detection approach for multi-scale defects on wind turbine blade surface. Mech. Syst. Signal Process.219, 111592 (2024). [Google Scholar]
  • 13.Zhang, R. & Wen, C. SOD-YOLO: A small target defect detection algorithm for wind turbine blades based on improved YOLOv5. Adv. Theory Simul.5(7), 2100631 (2022). [Google Scholar]
  • 14.Zhang, C., Yang, T. & Yang, J. Image recognition of wind turbine blade defects using attention-based MobileNetv1-YOLOv4 and transfer learning. Sensors.22(16), 6009 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yang, X., Zhang, Y., Lv, W. & Wang, D. Image recognition of wind turbine blade damage based on a deep learning model with transfer learning and an ensemble learning classifier. Renew. Energy.163, 386–397 (2021). [Google Scholar]
  • 16.Ye, X., Wang, L., Huang, C. & Luo, X. Wind turbine blade defect detection with a semi-supervised deep learning framework. Eng. Appl. Artif. Intell.136, 108908 (2024). [Google Scholar]
  • 17.Huang, C., Chen, M. & Wang, L. Semi-supervised surface defect detection of wind turbine blades with YOLOv4. Global Energy Interconnection.7(3), 284–292 (2024). [Google Scholar]
  • 18.Dwivedi, D., Babu, K. V. S. M., Yemula, P. K., Chakraborty, P. & Pal, M. Identification of surface defects on solar pv panels and wind turbine blades using attention based deep learning model. Eng. Appl. Artif. Intell.131, 107836 (2024). [Google Scholar]
  • 19.Xiaoxun, Z. et al. Research on crack detection method of wind turbine blade based on a deep learning method. Appl. Energy.328, 120241 (2022). [Google Scholar]
  • 20.Ran, X., Zhang, S., Wang, H. & Zhang, Z. An improved algorithm for wind turbine blade defect detection. IEEE Access.10, 122171–122181 (2022). [Google Scholar]
  • 21.Zhang, Y. et al. Surface defect detection of wind turbine based on lightweight YOLOv5s model. Measurement.220, 113222 (2023). [Google Scholar]
  • 22.Yao, Y., Wang, G. & Fan, J. WT-YOLOX: An efficient detection algorithm for wind turbine blade damage based on YOLOX. Energies.16(9), 3776 (2023). [Google Scholar]
  • 23.Tong, L. et al. WTBD-YOLOV8: An improved method for wind turbine generator defect detection. Sustainability.16(11), 4467 (2024). [Google Scholar]
  • 24.Li, J., Wen, Y., & He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6153-6162 (2023).
  • 25.Narayanan, M. SENetV2: Aggregated dense layer for channelwise and global representations. arXiv preprint. arXiv:2311.10807 (2023).
  • 26.Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117-2125 (2017).
  • 27.Li, H., Xiong, P., An, J., & Wang, L. Pyramid attention network for semantic segmentation. arXiv preprint. arXiv:1805.10180 (2018).
  • 28.Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3146-3154 (2019).
  • 29.Li, X. et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst.33, 21002–21012 (2020). [Google Scholar]
  • 30.Sunkara, R., & Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Joint European conference on machine learning and knowledge discovery in databases. 443-459 (2022).
  • 31.Li, C., Zhou, A., & Yao, A. Omni-dimensional dynamic convolution. arXiv preprint. arXiv:2209.07947 (2022).
  • 32.Zhong, J., Chen, J.,& Mian, A.,. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst.34(11), 9528–9535 (2022). [DOI] [PubMed]
  • 33.Zhang, X., Liu, C., Yang, D., Song, T., Ye, Y., Li, K., & Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv preprint. arXiv:2304.03198 (2023).
  • 34.Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1580-1589 (2020).
  • 35.Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision. 1314-1324 (2019).
  • 36.Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision. 3-19 (2018).
  • 37.Hou, Q., Zhou, D., & Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13713-13722 (2021).
  • 38.Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11534-11542 (2020).
  • 39.Wu, Z., Ding, T., Lu, Y., Pai, D., Zhang, J., Wang, W., et al. Token statistics transformer: Linear-time attention via variational rate reduction. arXiv preprint. arXiv:2412.17810 (2024).
  • 40.Ouyang, D., He, S., Zhang, G., Luo, M., Guo, H., Zhan, J., & Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing. 1-5 (2023).
  • 41.Xu, W., Wan, Y. & Zhao, W. ELA: efficient location attention for deep convolution neural networks. J. Real-Time Image Process.22(4), 1–14 (2025). [Google Scholar]
  • 42.Si, Y. et al. SCSA: Exploring the synergistic effects between spatial and channel attention. Neurocomputing.634, 129866 (2025). [Google Scholar]
  • 43.Jiang, P. T., Zhang, C. B., Hou, Q., Cheng, M. M. & Wei, Y. Layercam: Exploring hierarchical class activation maps for localization. IEEE Trans. Image Process.30, 5875–5888 (2021). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The WTBD818-DET dataset, along with the corresponding source code and trained model weights, is available from the corresponding author (Rugang Wang, email: wrg3506@ycit.edu.cn) upon reasonable request. In addition, the welding defect detection dataset used for comparative experiments is publicly available at https://www.kaggle.com/datasets/sukmaadhiwijaya/welding-defect-object-detection/data.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES