Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Mar 2;16:11798. doi: 10.1038/s41598-026-41557-5

A cascaded group attention mechanism-based object detection algorithm for construction and demolition waste

Zeping Jiang 1,#, Ying Yang 1,2,✉,#, Jiayi Hu 1, Chuyan Yuan 1
PMCID: PMC13066445  PMID: 41771958

Abstract

Accurate object detection is crucial for managing construction and demolition waste (CDW). However, existing deep-learning models often exhibit limited performance in detecting small objects within complex environments. This study proposes a YOLOv11-based detection algorithm integrated with a novel Cascaded Group Attention (CGA) mechanism to enhance the model’s ability to capture fine-grained features while significantly reducing computational and memory costs. First, we propose a transformer backbone based on CGA to improve long-range dependency modeling while substantially reducing redundant computations. Second, we employ a bidirectional multi-scale interaction module in the neck to integrate fine-grained details from high-resolution features with semantic context from low-resolution features, enabling accurate detection of CDW objects across scales. Finally, the proposed method is evaluated on two datasets. For comparison, we have reproduced several similar YOLOv11-based algorithms to validate the effectiveness of our approach. The results demonstrate a clear advantage of our approach, achieving mAP scores of 0.938 and 0.962, respectively, thereby surpassing the current state-of-the-art methods. Additionally, visualization of prediction results on test samples further confirms the high accuracy of our model. The data and code for this research can be obtained at the following link: https://github.com/jzp-csust/CDW.

Subject terms: Engineering, Mathematics and computing

Introduction

With the accelerating pace of urbanization, the generation of construction and demolition waste (CDW) has increased dramatically, now comprising approximately 40% of total municipal solid waste1,2. CDW contains recyclable materials such as concrete, bricks, and wood, which can be recovered and reused to manufacture new construction products35. However, current recycling practices are hampered by resource-utilization inefficiencies that stem from reliance on manual sorting, a process that is inherently time-consuming and costly68. To address this challenge, accurate and efficient CDW detection algorithms are essential to improve the safety and efficiency of CDW sorting and to enhance the recovery of high-quality materials9,10.

In resource-constrained field environments, lightweight convolutional neural networks have become a research hotspot due to their balance between accuracy and efficiency. Research in this field has expanded from traditional CNNs to Transformers and their hybrid architectures1113. By designing lightweight backbone networks and incorporating techniques such as attention modules, they have achieved both high accuracy and real-time performance in tasks like aerial inspection and industrial meter reading1417. Researchers have recently combined visual imaging with deep learning1820, opening new avenues for CDW detection21,22. Ku et al.23 first proposed an RCNN-based algorithm for waste sorting that automatically extracts convolutional features from images and uses supervised learning to train the network, achieving a 20% improvement in accuracy over traditional methods. Demetriou et al.5 conducted the first evaluation on a public CDW dataset and found that deeper backbone architectures yielded negligible performance gains in current models, suggesting that single-stage detectors can offer computational-efficiency advantages over two-stage detectors.

Among object detection algorithms, YOLO2427 and SSD (Single-Shot-Detector)28 are two classic and widely applied single-stage detection models. According to Demetriou’s research, the YOLOv7 model achieved the highest accuracy of 0.717 on the same dataset detection task, with the YOLO algorithm series overall demonstrating superior prediction performance. Therefore, we have chosen the YOLO algorithm, which is more suitable for the CDW detection task, for our experimental validation. The YOLO algorithm was introduced by Redmon et al.29 in 2016. Researchers continue to enhance detection accuracy and operational efficiency through model improvements30. Empirical evidence supports its effectiveness: for example, Demetriou reported that YOLOv7 achieved the highest accuracy (0.717) on a comparable CDW detection task, and the YOLO series algorithms outperformed SSD series algorithms on the same dataset. Based on these considerations, we select the YOLO framework for experimental validation of CDW detection in this study.

With the continuous evolution of the YOLO series algorithms, their application in CDW detection has seen further advancements. Fang et al.7 enhanced the feature extraction module and detection head of YOLOv7 by introducing the FB-Concat module and Conv-2 convolutional blocks, achieving a 2.8% improvement in detection accuracy and a 1.05Inline graphic increase in detection speed. Similarly, Song et al.31 optimized the detection head with DSConv modules, boosting the model’s mAP to 90.7%. Both studies focused on refining the convolutional kernels in YOLOv7’s detection head, indicating computational efficiency bottlenecks when processing complex CDW image features. In contrast, the YOLOv8 model, introduced by the Ultralytics team, achieved more significant breakthroughs through architectural innovations: its decoupled detection head improves feature disentanglement capabilities, while the C2f module (Cross-Stage Feature Fusion) enhances multi-scale feature extraction efficiency. These advancements established YOLOv8 as the then-recognized state-of-the-art model in the YOLO series32,33. Subsequent work had shown that YOLOv8 and its improved variants outperform YOLOv7 and its enhanced versions in both accuracy and computational efficiency3436. Yang et al.2 highlighted that CDW detection scenarios are complex and often involve severe occlusion, necessitating reinforced feature extraction capabilities in YOLOv8 to minimize redundant data acquisition and processing. Their research integrated the ECA (Efficient Channel Attention) mechanism into the bottleneck layer, which enhanced detection accuracy by 3% while reducing parameter count by 12%. Additionally, they optimized the C2f module using FasterNet, improving the model’s overall inference speed. However, the combined use of ECANet and FasterNet increased the architectural complexity of YOLOv8, leading to higher computational resource consumption, greater memory usage, and elevated hardware requirements (Table 1).

Table 1.

Comparison of CDW detection methods.

Method Year Key features Limitations
Ku et al.23 2021 RCNN, AE Restricted efficiency
Lu et al.22 2022 DeepLabv3+ Limited accuracy
Demetriou et al.5 2023 YOLO7 Restricted efficiency
Fang et al.7 2024 YOLO7 Restricted efficiency
Ranjbar et al.21 2025

MobileNetV3, Swin

Transformer

Limited accuracy

To further enhance the detection performance of the YOLO series models, Ultralytics released YOLOv11 in September 2024 as an improved successor to YOLOv8, introducing several key innovations. First, it replaces YOLOv8’s C2f module with a more computationally efficient C3k2 module, significantly improving processing speed. Second, to better detect complex or partially occluded objects, it incorporates a C2PSA module (combining CSPNet with spatial attention mechanisms), which strengthens the model’s focus on critical target regions. Additionally, the upgraded SPPF (Fast Spatial Pyramid Pooling) module enhances multi-scale feature extraction, substantially enhancing detection performance for objects of varying sizes and orientations–particularly excelling in small and overlapping object detection3739.

Although existing object detection models perform well on natural images, they still face significant limitations when directly applied to CDW detection. On one hand, different materials in CDW, such as concrete and ceramics, often have similar textures and colors, making it difficult for models to distinguish between them. General attention mechanisms often fail to adequately capture these subtle material differences. On the other hand, the scale of on-site debris varies dramatically–from large fragments in the foreground to small, distant accumulations. Existing multi-scale detection methods struggle with handling such irregular variations, frequently leading to missed detections of small targets or inaccurate localization of large ones. Building upon the advances introduced in YOLOv11, we propose a novel detection framework for the detection of CDW. The proposed approach integrates an efficient transformer-based backbone with a cascaded group attention mechanism and a cross-scale interaction neck, enabling precise detection of CDW objects under complex backgrounds, severe occlusions, and diverse size distributions. Cascaded grouped attention enables interaction and comparison among feature groups corresponding to different material categories, thereby further highlighting discriminative features between categories in the global context and suppressing common background interference. This “local focus first, global refinement later” cascaded approach allows the model to capture subtle differences between materials more finely, effectively mitigating the misdetection issues of similar materials such as concrete and ceramics. The main contributions of this work are summarized as follows:

  • We design an architecture that first performs cross-scale feature interaction to enhance semantic and spatial representations, followed by scale-specific detection heads for different object sizes.

  • We introduce a memory- and computation-efficient cascaded group attention mechanism within a transformer-based backbone to improve long-range dependency modeling while suppressing redundant computations.

  • We employ a bidirectional multi-scale interaction module in the neck to integrate fine-grained details from high-resolution features with semantic context from low-resolution features, enabling accurate detection of CDW objects across a wide range of scales.

  • The object detection algorithm we proposed outperforms the SOTA methods on both the BTC and SWP datasets.

Method

Overview

The proposed CDM detection algorithm is a single-stage object detection framework designed to address challenges such as complex backgrounds, severe occlusion, and diverse object scales in CDW scenes. As shown in Fig. 1, the architecture comprises three main components: a lightweight attention-enhanced backbone, a multi-scale feature fusion neck with spatial attention, and a decoupled detection head. The backbone employs an Overlap Patch Embedding strategy followed by hierarchical CGA Blocks to efficiently capture local and global contextual information. These features are then refined and fused in the neck using C3k2 modules and a C2PSA block to strengthen spatial feature representation across multiple scales. The detection head outputs category probabilities and bounding box coordinates, enabling accurate localization and classification of CDW objects.

Fig. 1.

Fig. 1

Overall architecture of the proposed CDW detection algorithm. The network consists of four main components: (a) backbone with overlap patch embedding and hierarchical transformer stages; (b) transformer block with cascaded group attention; (c) cascaded group attention mechanism; (d) neck and detection head with C2PSA and C3k2 modules for multi-scale detection.

The algorithm takes RGB images of CDW scenes as input, represented as tensors of size Inline graphic, where H and W are the image height and width, and the three channels correspond to red, green, and blue components. The output comprises a set of detected bounding boxes, each associated with a class label and a confidence score, effectively identifying objects across varying scales and occlusion conditions.

Cascaded group attention based backbone

The original lightweight backbone network of YOLOv11 is primarily based on traditional CNN architectures, which rely mainly on local convolutional operations. The inherent local receptive field mechanism of CNNs makes it difficult to efficiently model long-range dependencies in images, leading to limitations when handling tasks involving occluded objects, dense scenes, and scenarios requiring global contextual understanding. In contrast, Transformer models achieve global context modeling through self-attention mechanisms, enabling more effective capture of semantic relationships among different regions of an image. Although the YOLOv11 backbone is lightweight, by introducing optimized lightweight Transformer variants, we significantly enhance the model’s perception capability in complex scenes while maintaining comparable real-time inference speed.

The backbone network begins with an overlap patch embedding operation that transforms the input image Inline graphic into a sequence of patch tokens while retaining local spatial continuity. The specific architecture of this backbone is illustrated in Fig. 1a and b. Given a convolutional projection Inline graphic with kernel size k and stride Inline graphic to ensure overlap, the embedding process can be formulated as:

graphic file with name d33e493.gif 1

where Inline graphic and Inline graphic denote the spatial dimensions after embedding, and Inline graphic is the initial channel dimension Inline graphic. The overlapping design mitigates boundary artifacts and preserves fine-grained object details essential for small-scale CDW instance recognition.

Following patch embedding, the backbone adopts a three-stage hierarchical transformer architecture, where each stage consists of Inline graphic transformer blocks followed by a subsampling operation to reduce the spatial resolution and increase channel depth progressively. Let Inline graphic denote the transformer block stack in stage i, and Inline graphic denote the subsampling operation (implemented as a convolutional downsampling layer), the hierarchical feature extraction can be expressed as:

graphic file with name d33e531.gif 2

for Inline graphic, while the final stage output Inline graphic serves as the highest-level semantic representation. This design enables the backbone to capture both high-resolution local details and low-resolution global context, forming a strong foundation for multi-scale object detection in complex CDW environments.

Transformer layer

Each stage in the backbone is composed of multiple transformer layers, where each layer follows an EfficientViT-inspired sandwich layout that balances spatial dependency modeling and computational efficiency. Specifically, given the input feature Inline graphic for the l-th layer, a memory-efficient self-attention module Inline graphic is placed between two feed-forward network (FFN) modules Inline graphic, yielding:

graphic file with name d33e568.gif 3

where Inline graphic consists of a depthwise convolution–based token interaction layer followed by a channel-mixing MLP, enabling localized structural information to be embedded before and after spatial mixing. The attention module Inline graphic adopts a cascaded group attention mechanism to reduce redundancy by computing self-attention within grouped channel subspaces and progressively refining features across groups. This design minimizes the memory and computation cost of full attention, while retaining the ability to capture long-range dependencies and enhance discriminative feature representation for CDW objects.

Cascaded group attention

To efficiently model long-range dependencies while reducing computational redundancy, the self-attention module Inline graphic adopts a cascaded group attention (CGA) mechanism. Its specific module architecture is illustrated in Fig. 1b and c. Given the input feature Inline graphic output from the first Inline graphic, the channel dimension is evenly split into h groups, producing sub-features Inline graphic, where Inline graphic denotes the j-th group in the i-th attention head. For each group, the self-attention operation is computed as:

graphic file with name d33e618.gif 4

where Inline graphic are learnable projection matrices mapping the input to the query, key, and value spaces of dimension d. The outputs of all groups are concatenated and linearly projected to match the input dimension:

graphic file with name d33e631.gif 5

where Inline graphic is a learnable projection matrix.

In the cascaded refinement process, the output of the j-th group is augmented by the refined feature from the preceding group, enabling progressive information propagation across groups:

graphic file with name d33e645.gif 6

with Inline graphic. This progressive interaction enriches each group’s representation with contextual information from previously computed groups, improving feature discrimination with minimal additional computation. By combining grouped computation and cascaded refinement, CGA effectively captures global dependencies while maintaining a lightweight attention structure suitable for real-time CDW detection.

Multi-scale interaction-based detector

The detection head of the proposed CDW detection algorithm is designed to fully exploit multi-scale contextual information before performing object recognition at multiple resolutions. Its specific module architecture is illustrated in Fig. 1d. Let Inline graphic denote the set of multi-scale feature maps output by the backbone, where each Inline graphic corresponds to a specific spatial resolution and Inline graphic. A multi-scale interaction module Inline graphic first aggregates information across these feature maps through bidirectional feature fusion, enabling the high-resolution features to gain richer semantic context from low-resolution maps, while low-resolution features are enhanced with finer details from high-resolution maps:

graphic file with name d33e678.gif 7

the fused features Inline graphic are then forwarded to a set of scale-specific detection heads Inline graphic, each responsible for predicting bounding boxes and class probabilities for objects whose sizes are most effectively captured at the corresponding resolution:

graphic file with name d33e691.gif 8

Finally, predictions from all scales are combined to produce the final detection results. This multi-scale interaction, followed by multi-resolution detection, ensures robust performance on CDW objects exhibiting significant variation in size, shape, and occlusion.

Multi-scale interaction

The multi-scale interaction module adopts a bidirectional feature fusion strategy, combining a top-down pathway to pass rich semantic information from low-resolution, high-level features to high-resolution maps, and a bottom-up pathway to inject fine-grained spatial details from high-resolution features into coarser feature maps. Given the multi-scale feature set Inline graphic from the backbone, the fusion process alternates between upsampling and downsampling operations, with feature concatenation at each resolution to integrate complementary information. This reciprocal exchange ensures that features at all scales are simultaneously enriched with semantic context and spatial detail, providing a more robust representation for object detection across varying sizes.

At the fusion stages, we employ the C2PSA module to enhance spatial feature representation. Let Inline graphic be the concatenated feature map at a given resolution. C2PSA first applies a CSPNet-style split-transform-merge operation, partitioning the input channels into two parts: one directly propagated to the output, and the other processed through a position-sensitive spatial attention subnetwork. The spatial attention mechanism computes an attention map Inline graphic using both global average pooling and global max pooling along the channel dimension, followed by a Inline graphic convolution and sigmoid activation:

graphic file with name d33e718.gif 9
graphic file with name d33e722.gif 10

The attended feature is obtained via element-wise multiplication and finally merged with the shortcut path to yield the C2PSA output. This design strengthens the network’s focus on spatially informative regions, improving detection under occlusion and background clutter.

The C3k2 module is used for efficient feature transformation during both top-down and bottom-up fusion steps. Given an input Inline graphic, C3k2 adopts a cross-stage partial design to split the input into two branches: one branch undergoes two successive Inline graphic convolutions for local feature extraction, while the other serves as a direct identity connection. Formally, let Inline graphic divide the input along the channel dimension into two equal parts Inline graphic, the transformation branch computes:

graphic file with name d33e745.gif 11

while the shortcut branch directly propagates Inline graphic. The two branches are then concatenated and fused through a Inline graphic convolution to obtain the final output:

graphic file with name d33e759.gif 12

This structure reduces computational cost by avoiding full transformation of all channels, while maintaining strong representational capacity for multi-scale feature integration.

Multi-scale object detection

Following the cross-scale feature interaction, the fused feature maps Inline graphic are fed into scale-specific detection heads, each tailored to predict objects whose sizes align with the receptive field of that scale. Each detection head Inline graphic adopts a decoupled architecture with separate branches for classification, objectness estimation, and bounding box regression.

Formally, for the s-th scale feature map Inline graphic, the classification branch Inline graphic predicts a category probability tensor Inline graphic, where Inline graphic is the number of object classes:

graphic file with name d33e798.gif 13

with Inline graphic denoting the sigmoid activation.

Representing the probability of an object being present at each spatial location. The regression branch Inline graphic produces bounding box offsets Inline graphic:

graphic file with name d33e817.gif 14

where R denotes the number of discrete bins used for the integral representation of the box coordinates (typically set to 16). Consequently, the regression map Inline graphic encodes the probability distributions for the four spatial distances. For each grid cell, the bounding box is parameterized as (ltrb), which represent the distances from the current grid center to the left, top, right, and bottom boundaries of the target box, respectively. Instead of predicting deterministic values directly, the final distances are derived by calculating the expectation of the probability distributions over the range Inline graphic. The transformation from the predicted spatial distances (ltrb) to the absolute bounding box coordinates (xywh) is given by:

graphic file with name d33e872.gif 15

where Inline graphic denotes the coordinates of the center of the corresponding grid cell at scale s. Unlike anchor-based approaches, this formulation directly reconstructs the bounding box geometry without requiring pre-defined anchor boxes or logarithmic scale adjustments.

The predictions from all scales are aggregated to form the final detection set:

graphic file with name d33e886.gif 16

This multi-scale prediction strategy ensures that small objects are detected using high-resolution features, while larger objects are captured at lower resolutions, thereby improving detection robustness across the diverse object sizes, shapes, and occlusion conditions commonly encountered in CDW scenarios.

Loss function

The overall training objective of the proposed CDW detection algorithm combines the losses from classification, objectness estimation, and bounding box regression. For a given scale s, the detection head outputs the category probability tensor Inline graphic and the bounding box offsets Inline graphic.

For each positive sample (grid cell containing an object), the classification loss is computed using the binary cross-entropy (BCE) between the predicted class probabilities and the one-hot ground truth vector Inline graphic:

graphic file with name d33e912.gif 17

where Inline graphic denotes the set of positive samples at scale s, and Inline graphic is its cardinality.

The regression loss comprises two components: the Complete IoU (CIoU) loss and the Distribution Focal Loss (DFL). Let Inline graphic be the predicted bounding box coordinates derived from the spatial distances (ltrb), and Inline graphic be the ground truth box. The CIoU loss is utilized to enforce geometric consistency:

graphic file with name d33e951.gif 18

where Inline graphic is the Euclidean distance between the center points of the predicted and ground truth boxes, c is the diagonal length of the smallest enclosing box, and Inline graphic measures the aspect ratio consistency.

The DFL is designed to model the general distribution of the bounding box coordinates. Let Inline graphic denote the continuous ground truth distance for the direction Inline graphic of the i-th positive sample. The values closest to Inline graphic are the two nearest integers, Inline graphic and Inline graphic. The loss is defined as:

graphic file with name d33e993.gif 19

where Inline graphic represents the predicted probability that the distance in direction d falls on the integer index k. The terms Inline graphic and Inline graphic (omitting indices for brevity) act as interpolation weights, enabling the network to learn the precise spatial distribution from discrete outputs.

The final multi-scale loss is computed as the weighted sum over all scales s:

graphic file with name d33e1022.gif 20

where Inline graphic are balancing coefficients for the three loss components.

Experimental results and analysis

Experimental setup

All dataset samples were split into training, validation, and test sets with proportions of 70%, 10%, and 20% respectively. We evaluated model performance using four metrics: mAP50, mAP50:95, Recall, and F1-score. The metrics are defined as follows:

  • Inline graphic: Mean Average Precision at IoU threshold 0.5, measuring detection accuracy when the predicted bounding box overlaps with the ground truth by at least 50%.
    graphic file with name d33e1048.gif 21
    where N is the total number of classes. Inline graphic is the Average Precision for class i at IoU threshold 0.5, calculated as: Inline graphic, p(r) denotes the precision-recall curve at IoU=0.5
  • Inline graphic: The Inline graphic metric provides a comprehensive assessment of localization accuracy by averaging the mean Average Precision (mAP) over multiple Intersection over Union (IoU) thresholds, ranging from 0.5 to 0.95 in increments of 0.05.
    graphic file with name d33e1084.gif 22
    where Inline graphic is the set of IoU thresholds (10 thresholds in total, with a step size of 0.05). Inline graphic denotes the Average Precision at IoU threshold t, computed as:
    graphic file with name d33e1101.gif 23
    where Inline graphic is the precision–recall function for class i at IoU threshold t, and N is the total number of classes. The integral Inline graphic represents the area under the precision–recall curve for class i at threshold t.
  • Recall: The fraction of true positive detections among all actual positives:
    graphic file with name d33e1135.gif 24
    where TP (True Positives) is correctly detected positive instances, FN (False Negatives) is undetected positive instances.
  • F1-score: The harmonic mean of precision and recall, calculated as:
    graphic file with name d33e1157.gif 25

Regarding hyperparameter settings, we adopted a cosine-cycle-based learning rate decay schedule for the learning rate, with a maximum learning rate of 0.001 and a minimum of 0.00001. We set 200 training epochs, and training would terminate if the validation set loss did not decrease over 10 consecutive epochs. The batch_size was set to 16, the image size was uniformly standardized to 640 Inline graphic 640, and the Adam optimizer was used.

Dataset: We employed two publicly available CDW detection datasets in this study to evaluate our proposed algorithm. For convenience in subsequent descriptions, we use the initials of the categories included in each dataset as abbreviations. These two datasets are referred to as BTC (https://data.mendeley.com/datasets/24d45pf8wm/1) and SWP, respectively. Next, we will provide a detailed explanation of these two datasets. The dataset generated and analyzed during the current study is available in the Github repository (https://github.com/jzp-csust/CDW).

  1. BTC: The BTC Dataset was collected and made publicly available by5. It comprises samples extracted from manually sorted Construction and Demolition Waste (CDW) piles, covering three object categories: Bricks, Tiles, and Concrete. Using an RGB camera with a resolution of 1920Inline graphic1200, the researchers gathered 550 JPG images (with approximately 4 samples per object category), resulting in around 6,600 samples across all object categories.

  2. SWP: The SWP Dataset was collected and made publicly available by7. The researchers employed a high-definition industrial camera with a 2-megapixel resolution and a frame rate of 60 frames per second to capture the data. Approximately 20,000 images of construction waste were collected, encompassing three categories: Steel, Wood, and Plastic. The captured images had a resolution of 1280Inline graphic1024 and were saved in JPG format. After screening, a final selection of 4,460 images was extracted to constitute the dataset.

Implementation: All methods are implemented using the Python PyTorh framework. All experiments in this study are conducted on a server equipped cwith 4 RTX 4090 24GB GPUs, running Ubuntu 22.04 as the operating system.

Joint distribution of bounding box attributes

Figure 2 displays the distribution of target bounding box attributes in the BTC and SWP datasets, which are both for construction waste detection. They provide crucial clues for us to gain insights into the characteristics of the datasets. From the perspective of overall distribution, regarding the BTC dataset, the histogram of the x-coordinate of the bounding box center exhibits a specific shape. This may indicate that the distribution of target objects in the horizontal direction has its own characteristics, possibly being relatively uniform or showing a particular modal pattern. In contrast, in the SWP dataset, the distribution of the x-coordinate may be more inclined to concentrate in certain specific horizontal regions. This implies that in the SWP dataset, the positions of construction waste in the horizontal direction of the images are more regular. As for the y-coordinate distribution, both the BTC and SWP datasets may exhibit a certain degree of concentration tendency, but the specific concentrated regions are likely to be different. This reflects differences in the distribution characteristics of construction waste in the vertical positions of the images between the two datasets. When it comes to the histograms of width and height, there may be significant differences in the size ranges and concentration trends of construction waste between the BTC and SWP datasets. For example, the BTC dataset may contain more smaller-sized construction waste targets, while larger-sized targets may be more common in the SWP dataset.

Fig. 2.

Fig. 2

Comparative chart of joint distribution of bounding box attributes in BTC and SWP construction waste detection datasets.

In terms of the relationships between variables, the distribution pattern of target objects in the image plane as shown in the scatter plot of x and y in the BTC dataset is bound to be different from that in the SWP dataset. This implies differences in the position distribution characteristics of construction waste between the two datasets. For the scatter plots of width and height, if the scatter points in the BTC dataset exhibit a certain degree of correlation characteristics, while those in the SWP dataset show a different correlation pattern, this reflects differences in the shape characteristics of construction waste between the two datasets. Additionally, when analyzing scatter plots of the relationships between coordinates and sizes, such as x versus width and y versus height, the BTC and SWP datasets may also show different performances in this regard, further highlighting the differences between the two datasets.

Comparative experiments

To verify the effectiveness of our proposed building waste detection method, we compared it with YOLOv11 and some state-of-the-art (SOTA) methods based on YOLOv11. The results are shown in Table 2, Yolo5, Yolo8, and Yolo11 all use the Nano version. From the table, it can be observed that in the task of construction waste detection, the performance of different methods varies to some extent across the two datasets, BTC and SWP. Overall, the proposed method achieves the best performance in most metrics, particularly excelling on the SWP dataset. Specifically, in the four key metrics: mAP50, mAP50:95, F1, and Recall, the method we proposed significantly outperforms all other compared methods. In the SWP dataset, it achieves an mAP50 of 0.962, an F1 score of 0.958, and an impressive Recall of 0.992, demonstrating its strong advantages in both detection accuracy and recall rate. In contrast, other methods such as Yolo11-Goldyolo-Asf and Yolo11-WFU, while performing well, still fall short compared to our approach.

Table 2.

The performance comparison of SOTA methods for construction waste detection.

Methods BTC SWP GFLOPs
mAP50 mAP50:95 F1 Recall mAP50 mAP50:95 F1 Recall
Yolo5 0.891 0.406 0.839 0.956 0.901 0.747 0.849 0.964 4.5
Yolo8 0.897 0.410 0.842 0.961 0.909 0.752 0.858 0.971 8.7
Yolo1140 0.906 0.428 0.862 0.973 0.916 0.766 0.872 0.981 6.3
Yolo11-C3k2-iRMB41 0.885 0.385 0.834 0.971 0.886 0.692 0.857 0.983 6
Yolo11-SRFD42 0.915 0.459 0.863 0.981 0.933 0.789 0.905 0.985 7.6
Yolo11-Swintransformer43 0.916 0.393 0.874 0.981 0.940 0.766 0.903 0.987 77.6
Yolo11-WFU44 0.925 0.442 0.886 0.985 0.939 0.787 0.915 0.988 8
Yolo11-ADown45 0.928 0.439 0.887 0.982 0.906 0.731 0.857 0.984 7.9
Yolo11-Goldyolo-Asf46 0.930 0.439 0.893 0.984 0.944 0.794 0.928 0.985 9.5
Ours 0.938 0.495 0.889 0.992 0.962 0.839 0.958 0.992 5.3

Best in bold and second with Italics.

On the BTC dataset, the method we proposed also demonstrates outstanding performance, with an mAP50 of 0.938 and a Recall of 0.992, both of which are the highest among all methods. Notably, Yolo11-Goldyolo-Asf and Yolo11-ADown achieve mAP50 scores of 0.930 and 0.928, respectively, In the BTC dataset, which are relatively close to our method. However, they still lag behind in mAP50:95 and F1 scores, indicating that our proposed method not only excels in basic detection accuracy but also exhibits superior stability in high-threshold detection. Additionally, while Yolo11-Swintransformer and Yolo11-SRFD perform reasonably well in certain metrics, their overall performance remains inferior to our approach.

Table 3 compares the performance on the BTC and SWP datasets, providing a detailed evaluation for each specific category based on both mAP50 and mAP50:95 metrics. Overall, our method achieves the best or near-best performance on the majority of metrics, particularly excelling in mAP50:95 on the SWP dataset and for the tile category in mAP50:95 on the BTC dataset, demonstrating its excellent overall detection accuracy and generalization capability. In contrast, various baseline and improved models show fluctuations in performance across different datasets and categories. For example, Yolo11-SRFD performs best on mAP50 for the brick category in BTC (0.921), while Yolo11-C3k2-iRMB performs relatively weaker across multiple metrics.

Table 3.

Comparison of AP results for each class on the BTC and SWP datasets.

Methods BTC: mAP50 BTC: mAP50:95 SWP: mAP50 SWP: mAP50:95
concrete tile brick concrete tile brick plastics wood steels plastics wood steels
Yolo5 0.886 0.896 0.891 0.395 0.410 0.411 0.896 0.912 0.895 0.731 0.765 0.745
Yolo8 0.901 0.903 0.887 0.405 0.421 0.404 0.902 0.932 0.893 0.746 0.741 0.769
Yolo1140 0.945 0.897 0.876 0.434 0.460 0.391 0.881 0.968 0.899 0.742 0.853 0.702
Yolo11-C3k2-iRMB41 0.935 0.836 0.885 0.410 0.395 0.352 0.852 0.982 0.820 0.675 0.809 0.591
Yolo11-SRFD42 0.940 0.886 0.921 0.444 0.499 0.435 0.898 0.970 0.930 0.755 0.849 0.762
Yolo11-Swintransformer43 0.965 0.892 0.891 0.382 0.436 0.361 0.933 0.974 0.912 0.769 0.852 0.678
Yolo11-WFU44 0.953 0.919 0.903 0.453 0.496 0.376 0.925 0.979 0.913 0.773 0.872 0.717
Yolo11-ADown45 0.950 0.927 0.906 0.452 0.487 0.377 0.903 0.941 0.875 0.731 0.798 0.662
Yolo11-Goldyolo-Asf46 0.941 0.906 0.944 0.450 0.451 0.417 0.925 0.972 0.936 0.782 0.843 0.756
Ours 0.960 0.913 0.941 0.467 0.521 0.436 0.957 0.980 0.950 0.835 0.872 0.809

As can be seen from the computational complexity comparison in Table 2, different YOLO variants exhibit significant efficiency differences. The original YOLO5 has the lowest computational load (4.5 GFLOPs), demonstrating a clear lightweight advantage. In contrast, YOLO11-Swintransformer, due to the introduction of the Transformer structure, experiences a sharp increase in computational load to 77.6 GFLOPs, making it suitable only for scenarios with sufficient computational resources. Most variants in the YOLO11 series, such as YOLO11-SRFD, WFU, and ADown, have computational loads concentrated between 6.0 and 9.5 GFLOPs, representing a moderate computational burden. Among these, YOLO11-C3k2-iRMB (6.0 GFLOPs) stands out for its relatively higher efficiency within the series. Notably, our method achieves a computational load of 5.3 GFLOPs, which is slightly higher than that of YOLO5 but significantly lower than most YOLO11 variants, demonstrating a favorable balance in computational efficiency.

The table also reveals that the performance trends of different methods are not entirely consistent across the two datasets. For example, Yolo11-ADown performs well on the BTC dataset (mAP50 of 0.928) but achieves only 0.906 mAP50 on the SWP dataset, significantly lower than other methods. This suggests that the method may have limitations in its adaptability to different datasets. In contrast, the method we proposed maintains consistently high performance across both datasets, demonstrating stronger generalization capabilities and robustness. In summary, by optimizing the model architecture or training strategies, our proposed method achieves significant improvements across multiple key metrics, providing a more reliable solution for construction waste detection tasks.

Ablation experiments

To validate the effectiveness and individual contributions of the proposed Multi-scale Interaction (MSI) module and Cascaded Group Attention (CGA) module within our improved YOLO11 framework for construction waste detection, we conducted a series of ablation studies. The comparative results, as presented in Table 2, demonstrate the progressive performance gains achieved by integrating these components.

From the overall trend, for both the BTC and SWP datasets, the introduction of MSI module and the CGA module progressively led to steady and significant improvements across all performance metrics. Building upon the already excellent baseline performance of Yolo11, the final model integrating both MSI and CGA modules achieved the best results in all evaluation metrics for both tasks, validating the effectiveness and complementarity of each module design. Particularly noteworthy is the more than 6 percentage-point improvement in the more challenging average precision metric, mAP50:95, compared to the baseline, indicating a significant breakthrough in balancing localization accuracy and recall performance.

Regarding the contribution of each module, the MSI module primarily enhanced the model’s adaptability to objects of varying scales, contributing to an approximately 1–2 percentage-point increase in mAP across both tasks. In contrast, the CGA module further improved the model’s discriminative ability by strengthening the association between feature space and positional information, yielding a slightly larger gain than the MSI module when used individually. Ultimately, the synergistic effect of the MSI and CGA modules produced a “1+1>2” integration effect. While maintaining an extremely high recall rate (up to 0.992), the model substantially improved precision-related metrics. For instance, on the SWP dataset, the final model achieved an F1 score of 0.958, an improvement of nearly 9 percentage points compared to the baseline. This demonstrates particularly significant progress in reducing false positives and missed detections, effectively balancing precision and recall, and substantially enhancing overall detection reliability (Table 4).

Table 4.

Performance comparison of ablation experiments on construction waste detection datasets.

Method BTC SWP
Yolo11 MSI CGA mAP50 mAP50:95 F1 Recall mAP50 mAP50:95 F1 Recall
Inline graphic 0.906 0.428 0.862 0.973 0.916 0.766 0.872 0.981
Inline graphic Inline graphic 0.918 0.445 0.871 0.979 0.933 0.792 0.902 0.984
Inline graphic Inline graphic 0.925 0.462 0.878 0.985 0.942 0.805 0.913 0.987
Inline graphic Inline graphic Inline graphic 0.938 0.495 0.889 0.992 0.962 0.839 0.958 0.992

Significant values are in [bold].

Analysis of the training process

To enhance the transparency and reproducibility of our research, we provide a detailed description of the model’s evolution during the training process, including changes in loss functions and performance metrics. The results are shown in Figs. 3 and 4. Figure 3 demonstrates the training and validation results of our method on the BTC dataset. From the training loss curves, it can be observed that box_loss, cls_loss, and dfl_loss all exhibit a decreasing trend as the number of training epochs increases, and the decrease is relatively smooth. This indicates that the model can learn stably during the training process, and various losses are being gradually optimized. In terms of validation loss, a similar decreasing trend of box_loss, cls_loss, and dfl_loss can be seen as the training progresses, suggesting that the model’s performance on the validation set is also improving, with no significant overfitting phenomenon. Regarding the evaluation metrics, precision, recall, mAP50, and mAP50:95 all gradually increase with the training process. In particular, the improvements in precision and recall are more notable, indicating that the model’s accuracy and recall capabilities in the object detection task are both enhancing, and its overall performance is continuously improving.

Fig. 3.

Fig. 3

The changes in loss and metrics during the training and validation of our proposed method on the BTC dataset.

Fig. 4.

Fig. 4

The changes in loss and metrics during the training and validation of our proposed method on the SWP dataset.

Figure 4 presents the training and validation situation of our method on the SWP dataset. In the training loss curves, box_loss, cls_loss, and dfl_loss also show a decreasing trend. Although the magnitude and speed of the decrease differ from those of the first dataset, the overall trend is consistent, indicating that the model can also learn effectively on this dataset. The validation loss curves also display a similar decreasing pattern, suggesting that the model’s generalization ability on the validation set is gradually improving. In terms of evaluation metrics, precision, recall, mAP50, and mAP50:95 also increase with the number of training epochs. However, compared to the first dataset, there are differences in the magnitude of the increase and the final values achieved for some metrics. This may be due to the different characteristics and distributions of the two datasets.

Considering the results of both datasets comprehensively, our method demonstrates good training and validation effects on different data. The continuous decrease in training loss and validation loss indicates that the model has stable learning ability and generalization ability. The continuous improvement of evaluation metrics shows that the model’s performance in the object detection task is being gradually optimized. Although there are differences in the specific metric values and the magnitude of changes on the two datasets, the consistency of the overall trends verifies the effectiveness and robustness of our method in different scenarios, enabling it to adapt to different data distributions and achieve good detection results.

Analysis of key metrics

Confusion matrix

To conduct a detailed analysis of the degree of misjudgment of a certain category by the model during detection, we visualized the confusion matrices of BTC and SWP. The results are shown in Fig. 5. From these two normalized confusion matrices, they present the prediction situations of various categories in the construction waste object detection dataset, where the background represents the blank areas in the images.

Fig. 5.

Fig. 5

The normalized confusion matrix of the test set prediction results.

In the first confusion matrix, it involves concrete, tile, brick, and the background. The correct prediction rate for the concrete category is 0.84, indicating that most concrete objects can be accurately identified. However, there are still some misjudgments, such as 0.15 being misjudged as tile, 0.01 as brick, and 0.34 of the background areas being misdetected as concrete. This implies that the model has certain errors in distinguishing concrete from the background. The tile category performs outstandingly with a correct prediction rate of 0.92 and only a small amount of misjudgment. Nevertheless, 0.36 of the background is also misdetected as tile. The brick category has a correct prediction rate of 0.88, and there is also a situation where the background is misdetected as brick, with a proportion of 0.30. The background itself has a low correct prediction proportion, and a large amount is misdetected as other construction waste categories, reflecting that the model’s ability to recognize the background is relatively weak.

The second confusion matrix includes plastic, wood, steel, and the background. The correct prediction rate for plastic is 0.94, showing a good performance, but 0.44 of the background is misdetected as plastic. The detection effect for wood is excellent, with a correct prediction rate of 0.99 and only a minimal amount of misjudgment. The proportion of the background misdetected as wood is 0.12. The correct prediction rate for steel is 0.96, and the proportion of the background misdetected as steel is also 0.44. Similarly, the background has a low correct prediction proportion, and most of it is misjudged as other categories.

Overall, the model has a certain detection effect on construction waste categories such as concrete, tile, brick, plastic, wood, and steel, with some categories having relatively high detection accuracy. However, the model has obvious problems in distinguishing construction waste objects from the background. The background is largely misdetected as various types of construction waste, which may affect the accuracy and reliability of object detection. Subsequent optimization of the model is needed to improve its ability to recognize the background and reduce misdetection.

Precision-recall curve

To gain a detailed understanding of the model’s overall performance and reveal the differences in its performance under various types of predictions, we visualized the Precision-Recall (PR) curves of the prediction results for the BTC and SWP datasets. This was done to analyze the model’s missed detections and false detections. The results are shown in Fig. 6.

Fig. 6.

Fig. 6

The normalized confusion matrix of the test set prediction results.

In the first graph, the PR curves illustrate the trade-off between precision and recall for the model when detecting concrete, tile, brick, and all classes. It can be observed that the PR curves for these three classes are relatively close and maintain high precision levels until the recall approaches 1, where precision drops sharply. This indicates that the model can cover most positive samples while maintaining high precision when detecting these classes. The values annotated in the legend represent the Average Precision (AP) for each class and all classes. Concrete has the highest AP of 0.960, followed by brick at 0.941 and tile at 0.913. The mean Average Precision at an IoU threshold of 0.5 (mAP50) for all classes is 0.938, demonstrating the model’s overall good detection performance.

The second graph also presents PR curves for plastics, wood, steels, and all classes. Similar to the first graph, the PR curves for these three classes maintain high precision over most of the recall range, with precision dropping rapidly as recall approaches 1. Wood has the highest AP of 0.980, followed by plastics at 0.957 and steels at 0.950. The mAP50 for all classes is 0.962. Compared to the mAP in the first graph, the overall mAP in the second graph is higher, indicating that the model’s comprehensive performance in detecting these classes on this dataset is more robust overall.

Overall, both PR curve graphs demonstrate that the model performs well on their respective datasets, achieving high recall at high precision levels. However, there are still performance differences among the classes.

Visual analysis of predicted results

To demonstrate the precision and persuasiveness of our algorithm in the construction waste detection task, we selected some samples from the test sets of two datasets, namely BTC and SWP, for visualization of the results. The results are shown in Figs. 7 and 8. Figure 7 provides a detailed display of the prediction results of the object detection algorithm on the BTC dataset, along with their corresponding ground truth labels, covering the detection of various materials such as concrete blocks, tile fragments, and bricks. As can be seen in the images, the algorithm can accurately identify and classify these materials in most cases, demonstrating certain detection capabilities. However, when faced with visually similar materials or complex scenarios (such as overlapping objects, adjacent objects, and varying lighting, angles, and occlusions), the algorithm may experience misclassifications, missed detections, or mistakenly identify multiple objects as one. This indicates that there is still room for improvement and enhancement in terms of the algorithm’s detection accuracy and robustness.

Fig. 7.

Fig. 7

Visualization of true labels and predicted results for example test samples in the BTC dataset.

Fig. 8.

Fig. 8

Visualization of true labels and predicted results for example test samples in the SWP dataset.

Figure 8 presents the prediction results of the object detection algorithm on SWP dataset (comprising different materials such as plastics, wood, and steel) along with their corresponding annotations. From these images, it is evident that the algorithm’s predictions are extremely accurate, with almost no classification errors in all the displayed samples. Each material, be it plastics, wood, or steel, is correctly identified and labeled, and the confidence scores are generally high, mostly close to or exceeding 0.9, indicating the algorithm’s strong capability in learning and recognizing the features of these materials. Moreover, even in cases where the materials have irregular shapes, complex surface textures, or partial occlusions, the algorithm can still maintain a high level of detection accuracy, demonstrating its excellent robustness and generalization ability. Overall, these prediction results attest to the outstanding performance of the object detection algorithm on this specific dataset.

Visual results also reveal the model’s limitations in specific complex scenarios. As shown in Fig. 7(486.jpg), misclassification across categories occurs between concrete fragments and white ceramic tiles due to their local color and texture similarity, reflecting the model’s insufficient utilization of global context (e.g., fragment shape) in fine-grained material discrimination. In Figure 8(4412.jpg), a single curved steel bar is incorrectly detected as two separate objects. We attribute this primarily to the excessive differences in appearance and apparent motion direction between distinct parts of the object when its shape is severely bent, leading to a break in the model’s tracking continuity judgment and failure to recognize it as a complete entity. This indicates that the current model still lacks sufficient capability in holistic modeling of non-rigid objects with drastic shape variations.

Conclusions

This paper proposes a novel attention mechanism-based YOLOv11 object detection algorithm for automatically classifying CDW. The algorithm significantly improves detection performance under complex backgrounds, severe occlusion, and multiscale targets by introducing a CGA mechanism, multi-scale feature interaction modules (C2PSA and C3k2), and a bidirectional feature fusion strategy. Experimental results on two public CDW datasets, BTC and SWP, demonstrate that the algorithm outperforms existing state-of-the-art methods in key metrics such as mAP50, F1-score, and recall, achieving mAP50 values of 0.938 and 0.962, respectively, which reflects excellent detection accuracy and robustness. Furthermore, the training process remains stable with consistently decreasing validation loss and no significant overfitting, indicating strong generalization capability. Although some misdetection between background and target objects remains, overall, the algorithm provides reliable technical support for the automated sorting of construction waste and possesses considerable practical application value.

Author contributions

Zeping Jiang conceived the experiment(s), Zeping Jiang and Ying Yang conducted the experiment(s), Jiayi Hu and Chuyan Yuan analysed the results. All authors reviewed the manuscript.

Funding

This work was supported by the National Key R&D Program of China through the project “Key Technology Research and Application Demonstration for the Construction of Low-Carbon Ecological Rural Communities” (No. 2024YFD160040). Ying Yang is a grantee.

Data availability

The datasets generated and/or analysed during the current study are available in the Github repository: https://github.com/jzp-csust/CDW.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Zeping Jiang and Ying Yang contributed equally to this work.

References

  • 1.Li, J. et al. Rgb-d fusion models for construction and demolition waste detection. Waste Manage.139, 96–104 (2022). [DOI] [PubMed] [Google Scholar]
  • 2.Yang, Y., Li, Y. & Tao, M. Fe-yolo: a lightweight model for construction waste detection based on improved yolov8 model. Buildings14, 2672 (2024). [Google Scholar]
  • 3.Guo, H. et al. Durability of recycled aggregate concrete-a review. Cement Concr. Compos.89, 251–259 (2018). [Google Scholar]
  • 4.Dimitriou, G., Savva, P. & Petrou, M. F. Enhancing mechanical and durability properties of recycled aggregate concrete. Constr. Build. Mater.158, 228–235 (2018). [Google Scholar]
  • 5.Demetriou, D. et al. Real-time construction demolition waste detection using state-of-the-art deep learning methods; single-stage vs two-stage detectors. Waste Manage.167, 194–203 (2023). [DOI] [PubMed] [Google Scholar]
  • 6.Davis, P., Aziz, F., Newaz, M. T., Sher, W. & Simon, L. The classification of construction waste material using a deep convolutional neural network. Autom. Constr.122, 103481 (2021). [Google Scholar]
  • 7.Fang, H., Chen, J., Wang, M., Wu, Q. & Wang, Z. Real-time detection of construction and demolition waste impurities using the improved yolo-v7 network. J. Mater. Cycles Waste Manage.26, 2200–2213 (2024). [Google Scholar]
  • 8.Langley, A., Lonergan, M., Huang, T. & Azghadi, M. R. Analyzing mixed construction and demolition waste in material recovery facilities: Evolution, challenges, and applications of computer vision and deep learning. Resour. Conserv. Recycl.217, 108218 (2025). [Google Scholar]
  • 9.Lin, K. et al. Deep convolutional neural networks for construction and demolition waste classification: Vggnet structures, cyclical learning rate, and knowledge transfer. J. Environ. Manage.318, 115501 (2022). [DOI] [PubMed] [Google Scholar]
  • 10.Xiao, W., Yang, J., Fang, H., Zhuang, J. & Ku, Y. A robust classification algorithm for separation of construction waste using nir hyperspectral system. Waste Manage.90, 1–9 (2019). [DOI] [PubMed] [Google Scholar]
  • 11.Song, H. et al. Symmetrical learning and transferring: efficient knowledge distillation for remote sensing image classification. Symmetry17, 1002 (2025). [Google Scholar]
  • 12.Song, H. et al. Cmkd-net: a cross-modal knowledge distillation method for remote sensing image classification. Adv. Space Res.75, 8515–8534. 10.1016/j.asr.2025.04.009 (2025). [Google Scholar]
  • 13.Song, H., Yuan, Y., Ouyang, Z., Yang, Y. & Xiang, H. Quantitative regularization in robust vision transformer for remote sensing image classification. Photogram. Rec.39, 340–372 (2024). [Google Scholar]
  • 14.Shen, J. et al. Lightweight semantic feature extraction model with direction awareness for aerial traffic object detection. IEEE Trans. Intell. Transport. Syst.2025, 1–18. 10.1109/TITS.2025.3642410 (2025). [Google Scholar]
  • 15.Shen, J. et al. An anchor-free lightweight deep convolutional network for vehicle detection in aerial images. IEEE Trans. Intell. Transp. Syst.23, 24330–24342. 10.1109/TITS.2022.3203715 (2022). [Google Scholar]
  • 16.Shen, J., Liu, N., Sun, H., Li, D. & Zhang, Y. An instrument indication acquisition algorithm based on lightweight deep convolutional neural network and hybrid attention fine-grained features. IEEE Trans. Instrum. Meas.73, 1–16. 10.1109/TIM.2023.3346488 (2024). [Google Scholar]
  • 17.Shen, J. et al. Finger vein recognition algorithm based on lightweight deep convolutional neural network. IEEE Trans. Instrum. Meas.71, 1–13. 10.1109/TIM.2021.3132332 (2022). [Google Scholar]
  • 18.Liu, L. et al. Deep learning for generic object detection: a survey. Int. J. Comput. Vis.128, 261–318 (2020). [Google Scholar]
  • 19.Xiao, Y. et al. A review of object detection based on deep learning. Multimedia Tools Appl.79, 23729–23791 (2020). [Google Scholar]
  • 20.Ghasemi, Y., Jeong, H., Choi, S. H., Park, K.-B. & Lee, J. Y. Deep learning-based object detection in augmented reality: a systematic review. Comput. Ind.139, 103661 (2022). [Google Scholar]
  • 21.Ranjbar, I., Ventikos, Y. & Arashpour, M. Deep learning-based construction and demolition plastic waste classification by resin type using rgb images. Resour. Conserv. Recycl.212, 107937 (2025). [Google Scholar]
  • 22.Lu, W., Chen, J. & Xue, F. Using computer vision to recognize composition of construction waste mixtures: a semantic segmentation approach. Resour. Conserv. Recycl.178, 106022 (2022). [Google Scholar]
  • 23.Ku, Y., Yang, J., Fang, H., Xiao, W. & Zhuang, J. Deep learning of grasping detection for a robot used in sorting construction and demolition waste. J. Mater. Cycles Waste Manage.23, 84–95 (2021). [Google Scholar]
  • 24.Bochkovskiy, A., Wang, C.-Y. & Liao, H.-Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
  • 25.Li, C. et al. Yolov6: a single-stage object detectdemetriouion framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022).
  • 26.Redmon, J. & Farhadi, A. Yolo9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7263–7271 (2017).
  • 27.Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7464–7475 (2023).
  • 28.Liu, W. et al. Ssd: single shot multibox detector. In European Conference on Computer Vision 21–37 (Springer, 2016).
  • 29.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 779–788 (2016).
  • 30.Lu, J., Song, W., Zhang, Y., Yin, X. & Zhao, S. Real-time defect detection in underground sewage pipelines using an improved yolov5 model. Autom. Constr.173, 106068 (2025). [Google Scholar]
  • 31.Song, Q., Yuan, C., Cao, L., Huang, C. & Yang, Q. Yolo-af: a garbage dump detection model combining attention mechanism and focused convolution. Comput. Sci. Appl.14, 468 (2024). [Google Scholar]
  • 32.Sohan, M., Sai Ram, T. & Rami Reddy, C. V. A review on yolov8 and its advancements. In International Conference on Data Intelligence and Cognitive Informatics 529–545 (Springer, 2024).
  • 33.Bai, C., Bai, X. & Wu, K. A review: Remote sensing image object detection algorithm based on deep learning. Electronics12, 4902 (2023). [Google Scholar]
  • 34.Wei, J. et al. A review of yolo algorithm and its applications in autonomous driving object detection. IEEE Access (2025).
  • 35.Tang, W., Deng, Y. & Luo, X. Rst-yolov8: an improved chip surface defect detection model based on yolov8. Sensors25, 3859 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Hidayatullah, P., Syakrani, N., Sholahuddin, M. R., Gelar, T. & Tubagus, R. Yolov8 to yolo11: a comprehensive architecture in-depth comparative review. arXiv preprint arXiv:2501.13400 (2025).
  • 37.YOLOv11, R. K. M. H. An overview of the key architectural enhancements (2024).
  • 38.Wang, Y., Bu, F., Sun, A. & Zhang, Y. Cdwnet: an efficient and lightweight classification and object detection algorithm for construction and demolition waste. In 2024 4th International Conference on Electronic Information Engineering and Computer Communication (EIECC) 1370–1374 (IEEE, 2024).
  • 39.Zhang, J. et al. Behavior detection of dairy goat based on yolo11 and elslowfast-lstm. Comput. Electron. Agric.234, 110224 (2025). [Google Scholar]
  • 40.Khanam, R. & Hussain, M. Yolov11: an overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 (2024).
  • 41.Zhang, J. et al. Rethinking mobile block for efficient attention-based models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 1389–1400 (IEEE Computer Society, 2023).
  • 42.Lu, W., Chen, S.-B., Tang, J., Ding, C. H. & Luo, B. A robust feature downsampling module for remote-sensing visual tasks. IEEE Trans. Geosci. Remote Sens.61, 1–12 (2023). [Google Scholar]
  • 43.Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision 10012–10022 (2021).
  • 44.Li, W. et al. Efficient face super-resolution via wavelet-based feature enhancement network. In Proceedings of the 32nd ACM International Conference on Multimedia 4515–4523 (2024).
  • 45.Wang, C.-Y., Yeh, I.-H. & Mark Liao, H.-Y. Yolov9: learning what you want to learn using programmable gradient information. In European Conference on Computer Vision 1–21 (Springer, 2024).
  • 46.Wang, C. et al. Gold-yolo: efficient object detector via gather-and-distribute mechanism. Adv. Neural. Inf. Process. Syst.36, 51094–51112 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and/or analysed during the current study are available in the Github repository: https://github.com/jzp-csust/CDW.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES