Abstract
Sewer pipelines are a critical foundation of modern urban infrastructure, requiring frequent inspection to ensure integrity and prevent deterioration. Manual evaluation of CCTV footage is inefficient and prone to errors, particularly under real-world conditions where low-contrast imagery, visual noise, and invisible sediment deposits hinder defect detection. Existing deep learning models have struggled to generalize across these challenges, motivating the need for a more accurate and resilient framework for classification and localization. We present a two-stage approach that first leverages a hybrid ResNet50–Swin Transformer classifier to robustly distinguish defective from non-defective images with 90.28% accuracy, dramatically reducing misclassification and data volume for the subsequent stage. This filtering mechanism directly enhances detection precision—improving the mAP of our modified YOLOv8 from 0.70 to 0.81 (a 11% gain) by eliminating false-positive inputs. For defect localization, we integrate Convolutional Block Attention Modules (CBAM) into YOLOv8, enhancing the model’s ability to focus on regions where defect boundaries are difficult to distinguish. This attention mechanism significantly improves localization precision, particularly for visually subtle or ambiguous defects. The fusion of residual learning and hierarchical attention proves highly effective on our inspection dataset comprising 6912 images from over 200 sewer pipelines in Iran. This system offers a scalable, real-time solution for urban infrastructure monitoring, outperforming baseline methods in both efficiency and accuracy.
Keywords: Sewer pipelines, Closed-circuit television (CCTV), Classification and localization, Hybrid ResNet50–Swin transformer, Modified YOLOv8, Convolutional block attention module (CBAM)
Subject terms: Engineering, Environmental sciences, Mathematics and computing
Introduction
Urban sewerage systems are the backbone of urban infrastructure, providing efficient management of wastewater. These pipelines have been operating in most nations for over one century, which has led to structural failure1. Such failure can lead to severe issues like sanitary sewer overflows, infiltration, and even sinkholes in extreme situations. For instance, the U.S. The Environmental Protection Agency says that between 23,000 and 75,000 sanitary sewer overflows occur each year nationwide, posing a threat to water quality and public health2. More than half of all such occurrences are due to blockage through grease and debris accumulation in pipes, with other reasons being tree root invasion, pipe fractures, sediment buildup, and abnormal flow patterns as a result of excessive infiltration. Forecasting requires spending more than $270 billion over the next 25 years on wastewater infrastructure, most of which will go toward operations and maintenance3. These factors make it necessary to have more efficient inspection and evaluation practices. Sewer pipe defects come in different forms with diverse characteristics and effects on structural integrity. Pipe deposits, for instance, are not formed equally either; some develop on pipe walls (attached deposits), and some develop at the pipe bottom (settled deposits), with potential interference with flow4. Tree root intrusion ranges in severity from dense masses that take up large volumes of pipe to dispersed, small roots that are less constrictive. Defects need to be located and evaluated early to prevent further deterioration and guarantee the performance of sewer systems5. An open sewer joint is a gap or misfit between two touching pipe sections, often caused by ground movement, thermal expansion, or faulty installation. This is a common fault in aged pipes that lack new-age sealing products and can cause groundwater infiltration, sewage exfiltration, and root intrusion by trees. This increases the cost of treatment, environmental pollution, and damage to the structural integrity of pipes. The intensity is based on gap size and visibility of soil or void; the bigger the gap, the greater the intensity. Early detection within the damage progression using new technology like CCTV surveys in combination with deep learning algorithms is highly important to enable early repair and prevent further degradation6.
Closed-circuit television (CCTV) is a method of investigating the interior of sewer pipes through the use of a camera fitted on a tractor unit that is driven along the pipeline, providing video images on surface monitors. Defects are identified manually by halting the unit and using the zoom capability to examine possible defects close-up. The practice is to make notes on the type of defect, location, and quantity in terms of code or standard. After the inspection, videos might require re-viewing to validate defect information and determine sewer conditions. Regardless of differences in international standards, critical data for evaluating sewer condition commonly consists of defect type, location within the video frame, distance along the pipeline, number of defects, and severity score. Nevertheless, manual interpretation takes time, is labor-intensive, and invites inconsistency due to subjective judgments by individual inspectors, risking incorrect sewer condition evaluations. In order to counter such challenges, the use of computer vision techniques offers a promising solution. These techniques have the ability to automatically inspect videos, which can assist the inspectors in detecting defects and reduce workloads while enhancing assessment effectiveness. If inspection videos are available from past inspections, then computer vision also allows examination of past inspections and assists in modeling deterioration processes and predicting future conditions.
The Application of computer vision in the interpretation of inspection videos or images is gaining popularity. Traditional approaches are inclined to involve the construction of highly developed feature extractors and time-consuming image pre-processing, which becomes cumbersome and wasteful in training. Nevertheless, deep learning has emerged with robust performance in several computer vision tasks, such as image classification and object detection. In contrast to traditional methods, deep learning techniques enable robust feature extraction from images with minimal preprocessing, significantly enhancing accuracy and efficiency. However, standard deep learning approaches often fail to address challenges posed by customized datasets and varying data quality, such as low-contrast imagery, noise, and diverse pipe materials. To overcome these limitations, we developed a tailored hybrid model combining ResNet50 and Swin Transformer for classification and a modified YOLOv8n with attention mechanisms for localization, optimized to meet the specific demands of our sewer inspection dataset and achieve superior performance. The main contributions of this work include: (1) A hybrid ResNet50-Swin Transformer architecture that achieves 90.28% classification accuracy; (2) A modified YOLOv8 detector with optimized attention mechanisms achieving 0.81 mAP@50; (3) Systematic evaluation and optimization of attention mechanisms for sewer defect detection, demonstrating that CBAM integration provides optimal performance-efficiency balance for real-world deployment scenarios; (4) A comprehensive two-stage pipeline validated on over 6,912 images from Iranian sewer systems.
Related studies
In this part, we will first survey the relevant literature and then introduce our proposed methodology. This methodology is specifically developed to address the shortcomings of manual sewer inspections in Iran, with the goal of achieving more efficient and productive inspection of the sewer system.
Over the past few years, there has been a notable rise in the prominence of deep learning algorithms, particularly Convolutional Neural Networks (CNNs), for image classification and object detection. CNNs are capable of automatically extracting prominent and descriptive features from images. This approach simplifies the otherwise complex processes of feature extraction and training, thereby enhancing detection performance significantly7. Object detection technology, a subset of this field, excels in simultaneously detecting, classifying, and locating various types of defects within images using bounding boxes8,9. A prior drawback of CNNs was the requirement for extensive datasets of training images, coupled with high computational costs. However, this limitation has been surmounted through the creation of well-annotated datasets such as ImageNet and COCO, along with advancements in parallel computations utilizing graphics processing units (GPUs)10.
Image classification
Kumar et al.11 proposed a framework using CNNs to classify defects in sewer CCTV images. A preliminary system was implemented for the classification of root intrusions, deposits, and cracks. Hassan et al.12 classified six types of defects using the CNN-based network named AlexNet. This study proposes a defect classification system for CCTV inspection videos to enhance the efficiency of sewer pipeline condition assessment. Trained on 47,072 labeled images, the fine-tuned CNN model achieved the highest accuracy of 96.33%. Xie et al.13 developed a robust and efficient system for classifying sewer defects using a two-level hierarchical deep convolutional neural network and trained on a novel dataset of over 40,000 sewer images. Kumar and Abraham14 applied a two-step framework for automated sewer pipeline defect detection using a 5-layered CNN for classification and the YOLO model for fracture detection. Trained on 1,800 images, the framework achieved a 0.71 average precision (AP) score in detecting pipe fractures on a test set of 300 images. Li et al.15 proposed a deep convolutional neural network method for detecting and classifying defects in CCTV sewer inspection images, addressing the challenges of imbalanced datasets. A hierarchical classification approach is introduced to enhance performance, with high-level detection distinguishing defect images and trained on images from 24.7 km of sewer lines. Wang et al.16 introduced AutoSewerNet, a lightweight and high-performance model for sewer defect classification, designed using neural architecture search (NAS). By employing a diverse Supernet, gradient-based search, and weight balancing for imbalanced data, AutoSewerNet achieves an F1-score of 0.6251, outperforming ResNet50 and InceptionV3. It also uses only 11.6% of VGG-16’s computational resources, making it efficient for real-time sewer inspections.
Object detection
To classify and locate sewer defects several algorithms were introduced, which are the extension of CNN, have been developed such as region-based CNN (R-CNN)17, fast R-CNN18, faster R-CNN19, SPP-net20, Single Shot Detector (SSD)21, You Only Look Once (YOLO)22, along the way YOLO has evolved from YOLOv1 to YOLOv8, with significant improvement in detection algorithm and performance. These algorithms are capable of categorizing various objects into distinct classes and identifying the position of each object with a bounding box. The effectiveness of these algorithms can be evaluated using several metrics, including mean average precision (mAP), intersection over union (IoU), and training speed etc. Wang and Cheng23 developed a deep learning approach using faster R-CNN for sewer pipe defect detection from CCTV inspection videos. Trained on 3000 images, the model achieved an 83% mAP with a low missing rate and fast detection speed. Chen et al.24 propose a deep learning framework for sewer defect detection, introducing a modified RegNet + model with improved features, achieving 99.5% accuracy. It uses a dataset of 20 defect classes. Yu et al.25 proposed a multi-stage defect detection framework using a Swin Transformer-based composite backbone and a comprehensive defect dataset. By integrating modules like multi-stage detection, data augmentation, and model fusion, the mAP of 78.6% at IoU 0.5, surpassing ResNet50 Faster R-CNN by 14.1% and YOLOv6-large by 6.7%. Yang et al.26 proposed an improved YOLOX-s model incorporating MobileNetV3, SPPF, and CIoU for sewer defect detection, achieving faster inference, reduced model complexity, and a 0.88% mAP improvement. P. He et al.27 developed the DA-YOLOv7 model using data augmentation techniques to enhance defect detection in urban sewer pipelines, achieving 95.93% mAP and 0.025s detection time under complex conditions. J. Dong and M. Liao28 proposed the YOLOv8 algorithm, integrating SE and RFB attention mechanisms with wavelet de-noising to improve detection accuracy, stability, and multi-scale processing efficiency for sewer defect detection.
Zhao et al.29 developed PipeMamba, a state space model for automated sewer defect classification that achieves 61.53% mAP on QV-Pipe and 61.75% mAP on CCTV-Pipe while providing 3-7x faster inference than Transformer-based methods through multi-focus video segmentation and circular scanning strategies. Hu et al.30 developed TMFF for automated sewer defect detection, using optical flow to segment inspection videos into multi-focus views and Evidential Deep Learning to achieve 74.33% mAP while identifying unknown defect types. The framework outperforms existing methods but faces computational complexity and data imbalance challenges that limit real-time deployment. Zhao et al.31 introduced POEN for sewer defect classification, using optical flow-based Support Clip Modules to capture fine-grained defect details and Pareto-optimal weighting to balance classification accuracy with uncertainty estimation, achieving 76.36% mAP while improving unknown defect detection by 12.33% AUROC.
Semantic segmentation
Li et al.32 proposed PipeTransUNet, a model combining CNNs and Transformers, for semantic segmentation and severity quantification of sewer pipe defects. Enhanced with attention modules and improved activation functions, it achieved superior performance across various metrics, including a mean intersection over union of 71.92% and a mean F1-score of 83.05%. The model’s defect severity assessment aligns well with expert reviews, and pixel-level visual interpretations validate its reliability. Pan et al.33 developed PipeUNet, a semantic segmentation network enhanced with feature reuse and attention mechanism blocks between U-Net skip connections, achieving 76.37% mIoU for detecting sewer defects including cracks, infiltration, joint offsets, and intruding laterals while processing CCTV images at 32 fps. Zhou et al.34 developed a DeepLabv3 + method with Resnet-50 backbone for automated pixel-level sewer defect segmentation, achieving 0.9 PA and 0.53 mIoU while correctly assessing severity in 70% of cases, though with a tendency to overestimate severity in the remaining 30%.
Given the persistent challenges of low inter-class variability, poor image quality, low-contrast imager, and limited generalization in prior studies, a more robust and scalable solution is essential—one that can accurately classify and localize defects in complex real-world sewer environments. Our two-stage deep learning framework employs a hybrid ResNet50-Swin Transformer architecture for defect classification, where ResNet50’s convolutional layers excel at capturing local texture patterns critical for identifying defect-specific features (e.g., deposit patterns, surface irregularities), while the Swin Transformer’s self-attention mechanism captures long-range spatial dependencies essential for understanding defect context within the pipe structure. This complementary fusion—rather than simple concatenation—leverages ResNet50’s computational efficiency for local feature extraction and Swin Transformer’s global context modeling, achieving superior performance compared to either architecture alone while maintaining practical inference times through parallel processing. Combined with CBAM-enhanced YOLOv8 for localization, our approach delivers both accurate defect identification and precise localization, fulfilling the Water and Wastewater Company of Iran’s standards for efficient sewer maintenance planning.
Methodology
This study addresses limitations in automated sewer inspection, as mandated by the Water and Wastewater Company of Iran’s two-stage policy, which requires defect classification followed by severity assessment and spatial localization for maintenance prioritization. We propose a robust deep learning framework to enhance performance on low-quality CCTV images, integrating a hybrid ResNet50 and Swin Transformer for accurate defect classification and a modified YOLOv8n model with a convolutional block attention module (CBAM) for precise defect localization (mAP of 0.81). This framework aligns with national inspection standards, offering a scalable solution for smart sewer inspection systems in Iran. The study includes detailed model comparisons, localization metrics, and considerations for real-world deployment, as illustrated in the workflow (Fig. 1).
Fig. 1.
Overall proposed framework in this paper.
Hybrid ResNet50-Swin transformer for defect classification
The fusion strategy leverages ResNet50’s proven ability to extract hierarchical spatial features through residual connections, while Swin Transformer’s window-based attention mechanism captures long-range dependencies that CNNs typically miss in sewer inspection imagery. Rather than simple concatenation, our approach uses ResNet50 as a sophisticated feature extractor, with its output feeding directly into Swin Transformer stages for global context modeling. This architectural design addresses the specific challenges of sewer defect detection where local textural patterns (captured by ResNet50’s convolutional layers) must be integrated with global spatial relationships (modeled by Swin Transformer’s hierarchical attention) to distinguish subtle defects from normal pipe variations.
As demonstrated in the ablation study in Discussion section, our hybrid ResNet50-Swin Transformer model achieves superior performance compared to individual ResNet50 and standalone Swin Transformer architectures. While this fusion strategy results in increased computational overhead compared to lighter alternatives such as MobileNetV2, the substantial accuracy improvement aligns with the stringent requirements of the Water and Wastewater Company of Iran’s inspection protocols. The moderate inference latency per prediction remains well within acceptable bounds for practical deployment scenarios, while the enhanced detection accuracy directly supports the company’s mandate for reliable defect identification to prevent infrastructure failures and ensure public safety. This performance-complexity trade-off is justified given the critical nature of sewer infrastructure monitoring, where detection accuracy takes precedence over computational efficiency in meeting national inspection standards.
The input to the model is an RGB image of size 224 × 224 × 3, which initially undergoes a 7 × 7 convolution with a stride of 2, followed by batch normalization and a ReLU activation. A max pooling operation with a 3 × 3 kernel further reduces the spatial dimensions. This is followed by stacked bottleneck blocks composed of three convolutional layers with kernel sizes of 1 × 1, 3 × 3, and 1 × 1, respectively. Each of these is interleaved with batch normalization layers and ReLU activations. The 1 × 1 convolutions are used for dimensionality reduction and expansion, while the 3 × 3 convolutions capture spatial features. Downsampling is performed within specific bottleneck blocks using strided convolutions and matching identity mappings to ensure dimensional consistency. This deep residual structure enhances the network’s ability to learn complex features by enabling smooth gradient flow through the layers35,36. This is typically applied within deep residual blocks, which are often implemented using bottleneck structures. These residual blocks play a crucial role in mitigating the vanishing gradient problem associated with deep neural networks. Rather than learning the original mapping directly, residual learning reformulates the target as a residual function F(x), such that the overall output becomes Eq. 1.
![]() |
1 |
Where Wi denotes the learnable weights of the stacked layers inside the residual branch. This formulation is visually represented in the bottleneck block architecture in Fig. 2, which includes three convolutional layers, each followed by batch normalization. If the input and output dimensions differ, a downsampling path is included to match the shapes before the element-wise addition. The output of the summation is then passed through a ReLU activation function. This identity-based skip connection facilitates efficient gradient flow and stabilizes training in deep convolutional architectures.
Fig. 2.
The bottleneck of hybrid ResNet50.
The output of the ResNet-based feature extractor is then forwarded to the Swin TransformerV2 module, which captures global contextual relationships using a hierarchical self-attention mechanism. As shown in Fig. 3, this hybrid architecture begins with a series of convolutional layers: a 2D convolution layer followed by batch normalization, ReLU activation, and max pooling. This is succeeded by a stack of seven bottleneck residual blocks that form a deep and expressive convolutional backbone, optimized for extracting localized and low-to-mid-level features from sewer pipe inspection images.
Fig. 3.
Hierarchical structure of the Swin TransformerV2 architecture used in our proposed hybrid model.
Following this, the extracted feature maps are passed into the Swin TransformerV2 module, beginning with a PatchEmbed layer that reduces the spatial resolution and projects the features into the token space suitable for transformer processing. The transformer component is composed of four hierarchical stages, each represented by a Swin TransformerV2Stage. Each stage contains multiple Swin TransformerV2Blocks that perform window-based self-attention locally, with attention windows shifting alternately to enable cross-window information exchange. Between stages, PatchMerging operations are applied to progressively reduce the spatial resolution while increasing the feature dimensionality, enabling a multiscale representation.
Stage 0: contains two Swin blocks and maintains the original feature resolution.
Stage 1: includes downsampling via PatchMerging (384 → 192), followed by two Swin blocks.
Stage 2: downscales the feature maps from 768 to 384 channels and applies six Swin blocks, increasing the model’s depth and representational capacity.
Stage 3: the final stage, reduces the dimensionality from 1536 to 768 and includes two additional Swin blocks.
This hierarchical design allows the model to progressively build feature representations from fine-grained local textures to high-level semantic structures.
Our hybrid ResNet50-SwinTransformer model integrates the strengths of convolutional neural networks (CNNs) and Transformer architectures to enhance sewer defect classification, as depicted in Fig. 4. The process begins with raw CCTV images from sewer inspections, which are fed into a 2D convolution layer (Conv 2D). This initial step transforms the input image—typically a 3D tensor of shape (height, width, channels)—into a feature map with a reduced spatial dimension but increased depth, resulting in a tensor of shape (height/2, width/2, 512). Batch normalization (Batch Norm) and ReLU activation then stabilize and enhance these features, maintaining the same tensor shape while improving training efficiency. Next, the Max Pool layer reduces the spatial dimensions further by half, yielding a tensor of shape (height/4, width/4, 512). This is followed by a series of 7X residual bottleneck blocks, which deepen the network while preserving the tensor’s depth at 512 channels, though the spatial size remains (height/4, width/4, 512) due to residual connections that facilitate gradient flow. The Patch Embedding layer then segments the feature map into patches, converting it into a tensor of shape (num_patches, 96), where num_patches depends on the image size, marking the transition to the Swin TransformerV2 component.
Fig. 4.
Overall framework of the hybrid ResNet50–Swin TransformerV2 model.
The Swin TransformerV2 stages process these patches through multiple layers. The first stage (PatchMerge Stage 0) merges patches, reducing the number of patches while increasing the feature depth to a tensor of shape (num_patches/2, 192). Subsequent stages (PatchMerge Stages 1, 2, and 3) continue this pattern, progressively halving the patch count and doubling the depth to (num_patches/4, 384), (num_patches/8, 768), and finally (num_patches/16, 768), respectively. Each stage includes Swin TransformerV2 blocks that refine features using windowed attention, maintaining the respective tensor shapes. As illustrated in Fig. 5, the Swin TransformerV2 Block consists of two primary branches: a Window-based Multi-Head Self-Attention (W-MHSA) mechanism and a feed-forward MLP block, both encapsulated by pre-layer normalization and residual connections. Let the input to the block be X ∈ ℝB×N×C, where B is the batch size, N denotes the number of tokens in each window (e.g., 7 × 7), and C is the embedding dimension. The input is first normalized using LayerNorm, and then projected into query (Q), key (K), and value (V) matrices using three learned linear transformations:
![]() |
2 |
Fig. 5.

Structure of PatchEmbedding, PatchMerging, and a Swin TransformerV2Block and its components, including Window-based Multi-Head Self-Attention (MHSA) and MLP with residual connections.
Let X be the input tensor and WQ, WK, WV ∈ ℝC×C represent weight matrices for query, key, value, respectively. The self-attention mechanism computes raw similarity scores between query and key vectors using scaled dot-product attention:
![]() |
3 |
Here, h ∈ {1, …, H} indexes the attention head, and d = C/H is the dimension of each head. This design allows multiple attention heads to operate in parallel, each capturing distinct feature interactions. Unlike global attention, Swin TransformerV2 restricts these computations to local windows, which substantially reduces computational complexity from quadratic to linear with respect to image size. To retain spatial information lost due to local attention, Swin TransformerV2 integrates a continuous relative position bias mechanism. Instead of using fixed positional encodings, the model computes a relative positional offset Δpij = (xi-xj, yi-yj) ∈ ℝ2 between every pair of tokens i and j within a window and passes this vector through a two-layer MLP as:
![]() |
4 |
Where W1 ∈ ℝ 2×D, W2 ∈ ℝ D×1 (D is his hidden dimension used within positional MLP), and b1, b2 ∈ ℝ D are the learnable parameters of this MLP. This bias is added to the raw attention scores and the attention output for each head is computed as:
![]() |
5 |
Outputs from all heads are concatenated and projected through Wproj ∈ ℝ C×C to form the final attention result and a residual connection adds the Window-based Multi-Head Self-Attention (MHSA) output to the input:
![]() |
6 |
The second branch of the block is the feed-forward MLP, which further transforms the representation in a nonlinear manner. The input X′ is first normalized, then passed through two linear transformations separated by a GELU (Gaussian Error Linear Unit) activation and dropout. The MLP applies two linear transformations with a GELU activation and dropout in between. The element-wise computation of the output vector MLP(X′).
At the final stage of the Swin Transformer pipeline, the feature representations are first normalized using Layer Normalization (LayerNorm), then aggregated via Global Average Pooling, which reduces the spatial dimensions to a single value per channel, producing a compact feature vector of shape (1, 1, 768). To mitigate overfitting, a Dropout layer with a rate of 0.5 is applied. The resulting 768-dimensional vector is then passed through a fully connected (FC) layer, which maps it to 4 output neurons corresponding to the predefined sewer defect classes. This yields the final classification output of shape (1, 4), representing the logits for each class.
To train the model, we use the Cross-Entropy Loss implemented in PyTorch. This loss function measures the difference between the predicted class probabilities (after SoftMax) and the true class labels. For a single input sample, the loss is computed as Eq. 7:
![]() |
7 |
Where C, zj, and zy are the number of classes (in our case, 4), the logit (raw output) for class j, and the logit corresponding to the true class label y, respectively. This formulation encourages the model to assign higher scores to the correct class and penalizes incorrect predictions, thereby guiding the optimization during training. The complete architecture, illustrated in Fig. 4, combines the local feature extraction power of ResNet50 with the global context modeling strength of Swin TransformerV2, forming a robust hybrid model capable of accurately classifying subtle, low-contrast, and visually similar sewer pipe defects under complex imaging conditions.
Modified YOLOv8 with attention mechanism for defect localization
To validate our choice of attention mechanism, we conducted comparative experiments with state-of-the-art attention modules including ECA (Efficient Channel Attention), CA (Coordinate Attention), and SE (Squeeze-and-Excitation). As detailed in the ablation study (Table 7), while CA achieves marginally higher mAP@50 (0.82), CBAM provides the optimal performance-efficiency trade-off with competitive accuracy (mAP@50: 0.81, mAP@50:95: 0.66) and superior computational efficiency (4.5 M parameters, 8.5G FLOPS, 210 FPS). This balance is crucial for real-time sewer inspection applications where both accuracy and deployment feasibility are essential.
Table 7.
Ablation study comparing the proposed modified YOLOv8n with CBAM to state-of-the-art object detection models on sewer defect detection.
| Model | Image size | Epoch | mAP@50 | mAP@50:95 | FPS | Params(M) | FLOPs(G) |
|---|---|---|---|---|---|---|---|
| Modified YOLOv8n + CBAM (Ours) | 640 × 640 | 200 | 0.81 | 0.66 | 210 | 4.5 | 8.5 |
| YOLOv8n + ECA | 640 × 640 | 200 | 0.8 | 0.64 | 220 | 4.2 | 7.9 |
| YOLOv8n + CA | 640 × 640 | 200 | 0.82 | 0.67 | 200 | 4.7 | 9.2 |
| YOLOv8n + SE | 640 × 640 | 200 | 0.79 | 0.63 | 195 | 4.8 | 9.5 |
| Faster R-CNN23 | 512 × 512 | 200 | 0.54 | 0.35 | 1 | 41.42 | 133.3 |
| YOLOv314 | 640 × 640 | 200 | 0.53 | 0.38 | 20 | 32.5 | 130.8 |
| YOLOvX26 | 640 × 640 | 200 | 0.51 | 0.33 | 76 | 8.98 | 26.8 |
| Improved YOLOv5s46 | 640 × 640 | 200 | 0.61 | 0.45 | 143 | 7.1 | 15.9 |
| YOLOv727 | 640 × 640 | 200 | 0.65 | 0.48 | 115 | 6.04 | 13.1 |
| YOLOv8n28 | 640 × 640 | 200 | 0.76 | 0.57 | 230 | 3.2 | 4.3 |
| YOLOv9c47 | 640 × 640 | 200 | 0.86 | 0.71 | 160 | 25.5 | 102.8 |
| YOLOv10n47 | 640 × 640 | 200 | 0.74 | 0.55 | 240 | 2.3 | 6.8 |
To enhance the precision of sewer defect detection in challenging visual conditions, we customized the YOLOv8 by incorporating attention mechanisms—specifically the Channel Attention Module (CAM) and Spatial Attention Module (SAM). The rationale behind this design stems from the unique characteristics of sewer inspection imagery, where defects such as root intrusions, deposits, and open joints often exhibit low contrast, irregular shapes, and partial occlusion against highly textured or cluttered backgrounds. Conventional convolutional backbones may fail to adequately highlight these subtle features, resulting in suboptimal detection performance. By integrating CAM and SAM, inspired by the Convolutional Block Attention Module (CBAM)37, the model is guided to emphasize more informative channels and spatial regions, effectively suppressing irrelevant background noise and reinforcing discriminative patterns essential for defect localization. This attention-guided feature refinement enables the model to better capture spatial dependencies and context, which is critical in domains where the target structures are small and vary in morphology38. Consequently, the modified YOLOv8 architecture not only improves detection accuracy but also enhances the model’s robustness and generalization to diverse sewer environments, making it a more reliable tool for real-world infrastructure inspection tasks.
Backbone
The backbone of the proposed YOLOv8 architecture (Fig. 6) is optimized for extracting hierarchical and context-aware visual features from sewer inspection images. It follows a modular design consisting of the Convolutional Block (Conv), Cross-Stage Partial Bottleneck with Focus (C2f), Bottleneck modules, and a Spatial Pyramid Pooling – Fast (SSPF) block. To enhance its sensitivity to low-contrast and subtle defect features, a Convolutional Block Attention Module (CBAM) is embedded after C2f stages. Each Conv block, shown in light blue in Fig. 6 (leftmost box), consists of a 2D convolution layer (Conv2d), followed by Batch Normalization and the SiLU activation function. This configuration stabilizes training, improves convergence, and ensures nonlinear feature extraction. Given an input tensor of shape [Cin, Hin, Win] (where C, H, and W represent the number of input channels, the input height, and the input width, respectively), the output shape becomes [Cout, Hout, Wout], where Cout denotes the number of output channels. The spatial dimensions Hout and Wout depend on the kernel size k, stride s, and padding p, and are computed using Eq. 1039.
Fig. 6.
Conv, C2f, and SSPF blocks used in the backbone of the modified YOLOv8.
![]() |
8 |
The C2f (see Fig. 6) module begins with a Conv block that transforms the input tensor Xin into a feature map Xout ∈ ℝC1,H, W. This tensor is split into two equal branches along the channel dimension, each with shape [0.5C1, H, W]. One half bypasses the transformation layers, while the other is processed through a stack of Bottleneck modules with both shortcut-enabled (residual) and non-residual paths. Each Bottleneck consists of two Conv layers, and if shortcut = True, a residual connection is added. The output tensors, each still of shape [0.5C1, H, W], are concatenated along the channel axis to form a feature map of shape [0.5(n + 2)C, H, W], which is then converted to [C2, H, W] through a final Conv block. The Bottleneck block (top middle in Fig. 6) has two paths: with or without shortcut connections. The residual version adds the input to the output after two Conv layers, maintaining the shape [C1, H, W], while the non-residual path only consists of two sequential Conv layers with the same input-output shape.
CBAM: Convolutional Block Attention Module Integration.
In our proposed method, the Convolutional Block Attention Module (CBAM) is integrated into the YOLOv8 to enhance feature representation and improve defect detection performance. CBAM is composed of two sequential sub-modules: Channel Attention (CA) and Spatial Attention (SA), which are lightweight and computationally efficient. The Channel Attention module aims to refine feature responses by emphasizing important channels. It applies both global average pooling and max pooling across the spatial dimensions to aggregate spatial context37. These pooled features are passed through a shared multi-layer perceptron (MLP), implemented via two 1 × 1 convolution layers with a ReLU activation—first reducing the channel dimension from C2 to C2//r (typically r (ratio) = 16) to limit model complexity, and then restoring it back from C2//r to C2. and combined using a sigmoid activation function. The resulting attention map is used to scale the input feature map channel-wise. To enhance feature selectivity in the backbone, the Channel Attention Module (CAM) is applied to the input feature map of shape [C2, H, W]. This module emphasizes informative channels by aggregating spatial context through both global average pooling and global max pooling operations. These pooling operations are computed over all spatial positions (i, j), calculated using Eq. 12.
![]() |
9 |
where i ∈ [1, …, H] and j ∈ [1, …, W], for each channel c ∈ [1, C2] and F(c, i, j) denotes the value of the input feature map at spatial location (i, j) in channel c, and the resulting descriptors are two tensors of shape [C2, 1, 1]. Each of these descriptors is passed through a shared two-layer MLP as FC (fully connected), and their outputs are summed. A sigmoid activation function then generates a channel attention map of shape [C2, 1, 1], which is broadcast and multiplied element-wise with the original input tensor to rescale the channel-wise feature responses. The channel attention module is illustrated in Fig. 7.
Fig. 7.
Channel attention module.
Following CAM, the Spatial Attention Module (SAM) is employed to emphasize important spatial locations within each channel. Given the input tensor of shape [C2, H, W], SAM first computes the average and maximum values across the channel dimension at each spatial position (i, j) as shown in Eq. 10, yielding two feature maps of shape [1, H, W]:
![]() |
10 |
These two maps are concatenated along the channel axis to form a [C2, H, W] tensor, which is passed through a convolutional layer with a 7 × 7 kernel and padding of 3, followed by a sigmoid function. The resulting spatial attention map of shape [1, H, W] is then multiplied element-wise with the CAM-refined feature map. This operation adaptively boosts salient spatial regions while suppressing irrelevant background, preserving the overall tensor shape [C2, H, W] and improving the feature representation passed to the subsequent stage. The spatial attention module is shown in Fig. 8.
Fig. 8.
Spatial Attention Module.
As illustrated in Fig. 9, the CBAM module is sequentially composed of the Channel Attention Module (CAM) followed by the Spatial Attention Module (SAM), both encapsulated within a red-dashed boundary. The input feature map is first processed by CAM, which emphasizes informative channels via adaptive average and max pooling, shared MLP layers, and sigmoid scaling. The refined output is then forwarded to SAM, which highlights salient spatial locations by compressing channel-wise information and applying a convolution-based spatial attention map. The final output of the full CBAM block in the backbone retains the same spatial and channel dimensions as the input and is subsequently passed to the subsequent Spatial-Spatial Pyramid Fusion (SSPF) block. This positioning ensures that attention-refined features are delivered into the fusion layer, enhancing the model’s ability to represent critical visual cues before multi-scale integration.
Fig. 9.

Convolutional block attention module (CBAM).
At the final stage of the backbone, the SSPF block (Fig. 6, right box) aggregates multi-scale contextual information. A Conv layer first transforms the input into [C2, H, W], which is then processed in parallel by three successive MaxPool2d layers. These MaxPool layers preserve spatial resolution but vary in receptive fields through different kernel sizes and strides. Their outputs, each of shape [C2, H, W], are concatenated (resulting in [4C2, H, W] and passed through a Conv block (is applied to reduce the channels from 4 C₂ → C₂) to restore the feature dimensionality. This structure captures both fine-grained and large-scale contextual cues, essential for detecting defects of varying sizes. As shown in Fig. 6, the backbone follows a consistent tensor transformation path that maintains spatial resolution when necessary and enriches semantic abstraction when progressing deeper into the network. This careful design ensures robust feature encoding while maintaining computational efficiency, which is especially important for real-time sewer defect detection applications20.
Neck
The proposed Neck module was designed to strengthen multi-scale feature fusion for improved defect detection performance. As illustrated in Fig. 10, the Neck receives three sets of feature maps: [C2, H, W] from the SSPF block and [C′, 2 H, 2 W], [C′′, 4 H, 4 W] from the outputs of CBAM modules following the selected C2f blocks in the Backbone. The Neck follows a dual-path strategy to efficiently combine low-level spatial and high-level semantic information. In the top-down path, the deepest feature map from SSPF is successively upsampled and concatenated with corresponding Backbone features at each scale. These fused maps are refined through C2f blocks, effectively halving or averaging the channel dimensions to reduce redundancy and strengthen feature representation. The refined feature maps are then passed to the bottom-up path, where successive Conv blocks downsample the feature maps to their original scales. After each downsampling step, the feature maps are concatenated with outputs from the top-down path and further refined by C2f blocks, enhancing contextual consistency and improving feature localization.
Fig. 10.
Neck of modified YOLOv8 (CBAM applied in the all C2fs class).
This bi-directional fusion mechanism, inspired by PANet and BiFPN principles, creates a robust multi-scale feature pyramid that preserves both fine-grained details and abstract semantics40. The consistent integration of C2f blocks throughout the fusion hierarchy facilitates improved gradient flow and promotes better detection accuracy, particularly for small or ambiguous defect patterns in sewer inspection imagery41.
Head
The detection Head in the modified YOLOv8 (see Fig. 11) architecture predicts object bounding boxes, class probabilities, and objectness scores from the multi-scale feature maps produced by the Neck (Fig. 10). Each scale-specific feature map is processed independently by a lightweight series of convolutional layers to output dense, per-pixel predictions in an anchor-free format. The Head employs a decoupled design, where classification and regression branches operate separately after shared feature extraction. This structure minimizes task interference and leads to more stable and accurate convergence. The regression branch estimates the coordinates and dimensions of candidate bounding boxes, while the classification branch assigns confidence scores for predefined defect classes. During training, the total detection loss “Ldet” is computed as a weighted sum of box regression loss and classification loss, as shown in Eq. 11:
Fig. 11.
Detect block.
![]() |
11 |
where λbox and λcls are the balancing coefficients. The box regression loss “Lbox” is calculated using the Complete Intersection over Union (CIoU) loss, defined in Eq. 12:
![]() |
12 |
CIoU and IoU defined as Eq. 13 and Fig. 12.
![]() |
13 |
Fig. 12.

Intersection over union.
Where ρ2 (B, B*) is Euclidean distance between box centers, c is diagonal length of the smallest enclosing box covering both boxes, ʋ is Aspect ratio consistency term, and α is Trade-off parameter for aspect ratio. So CIoU Measures overlap plus distance between centers and aspect ratio consistency but IoU Measures only the overlap between predicted box and ground truth box as shown in Fig. 12.
The classification loss Lcls uses Binary Cross-Entropy (BCE) loss, after prediction, non-maximum suppression (NMS) is applied to eliminate redundant detections and retain the most confident results. This modular and efficient design allows the detection Head to maintain real-time performance while achieving high detection precision, particularly for small-scale and ambiguous defects commonly found in sewer pipelines.
The overall architecture of the customized YOLOv8 model (Fig. 13) is optimized for sewer defect detection, addressing challenges like low contrast, noise, and irregular defect shapes. It processes input images of size [3, 640, 640] through three key components: Backbone, Neck, and Head. The Backbone integrates convolutional and C2f modules with CBAM attention blocks, enabling the model to focus on the most informative spatial and channel features while suppressing irrelevant noise. A Spatial-Spatial Pyramid Fusion (SSPF) block at the end of the backbone enhances contextual understanding across multiple scales.
Fig. 13.
The overall architecture of modified YOLOv8.
The Neck employs a bi-directional feature fusion path, combining feature maps from different depths using Concat, Upsample, and C2f + CBAM modules. This fusion strengthens both spatial detail and semantic abstraction, enhancing the model’s ability to detect diverse defect types. In the Head, the architecture uses three parallel detection branches, each specialized for predicting small, medium, and large objects, ensuring robust multi-scale localization. After prediction, Non-Maximum Suppression (NMS) is applied to eliminate redundant detections. Overall, the integration of CBAM and multi-scale prediction significantly improves detection accuracy and reliability in challenging sewer inspection environments.
Data preparation
To develop a robust AI-based sewer defect detection system, we conducted approximately 200 robot-assisted CCTV sewer inspections across various regions in Iran. The data collection process was executed under realistic, working sewer conditions, which were highly erosive, humid, and poorly illuminated, making high-quality image acquisition particularly challenging. Despite these difficulties, robotic navigation and imaging flexibility enabled the capture of diverse and informative visual data essential for detecting structural anomalies. The imaging platform employed was the Wöhler VIS 700 HD, a widely adopted sewer inspection camera, featuring a 1/4” Sony CCD sensor, 480 TVL resolution, 360° pan and 180° tilt capability, and robust waterproofing rated at IP68. The camera was mounted on a crawler system capable of operating on 4 or 6 wheels, climbing inclines of 30° to 45°, and functioning within a − 20 °C to 55 °C temperature range. These specifications allowed comprehensive coverage of internal pipeline structures across 8-inch and 10-inch polyethylene sewer pipes, from various angles and depths.
The selection of three primary defect categories (root intrusions, deposits, and open joints) is based on comprehensive analysis of maintenance records provided by the Tehran Water and Wastewater Company. Statistical data from 2018 to 2023 indicates that these defects constitute approximately 87% of all structural issues reported in polyethylene sewer systems across Iran, with root intrusions, deposits, and open joints of documented cases. This distribution reflects the specific characteristics of polyethylene pipe infrastructure, where traditional concrete pipe defects such as cracking, structural collapse, and corrosion are significantly less prevalent due to the material’s flexibility and chemical resistance. The remaining 13% of defects, including rare occurrences of pipe fractures or severe structural damage, appeared insufficiently in our dataset (< 50 instances) to enable robust machine learning model development.
An initial dataset of approximately 10,000 raw CCTV images was collected. However, due to the difficult operational conditions, a significant portion of the data exhibited artifacts such as motion blur, occlusions, or irrelevance to the defect classes of interest. To ensure quality and relevance, we applied a comprehensive multi-stage preprocessing pipeline designed to balance data quality with operational deployment conditions.
Quality assessment and preprocessing pipeline
The preprocessing methodology was specifically designed to preserve challenging but realistic inspection scenarios while removing unusable data. Our quality assessment criteria included:
Motion Blur Assessment: Images with excessive motion blur (Laplacian variance < 100) that completely obscured structural details were removed, while those with moderate blur that retained visible pipe features were preserved to maintain dataset realism and robustness.
Contrast and Illumination Evaluation: Rather than eliminating all low-contrast images, we applied adaptive histogram equalization (CLAHE) selectively to enhance visibility while preserving the challenging nature of poorly illuminated conditions commonly encountered in sewer inspections. Images with contrast ratios above 0.15 were retained regardless of illumination quality.
Defect Visibility Verification: Images were included if defects remained identifiable to expert human inspectors, even under suboptimal visual conditions, ensuring the dataset reflects the full spectrum of inspection scenarios that automated systems must handle in practice.
Annotation Quality Control: All retained images underwent expert validation with two independent annotators to ensure accurate labeling despite challenging visual conditions.
This approach intentionally preserves challenging characteristics inherent in real-world sewer inspection imagery, including low contrast, uneven illumination, moderate blur, and environmental noise. Unlike controlled laboratory datasets, our preprocessing strategy maintains the difficult conditions that previous automated systems have struggled to address effectively, providing a more rigorous evaluation environment that accurately reflects operational deployment scenarios.
The resulting curated dataset comprises 6,912 images manually categorized into four classes: root intrusions, deposits, open joints, and normal (no defect). These images were specifically selected to represent the challenging visual conditions encountered in actual sewer inspections, including scenarios with poor lighting, low contrast, and moderate image degradation that are typical of underground infrastructure environments. An overview of these classes is illustrated in Fig. 14, showcasing representative appearances from the dataset under various challenging conditions.
Fig. 14.
Different classes of our dataset.
For the ResNet50–Swin Transformer classification model, images were resized to 224 × 224 pixels and normalized channel-wise to ensure stable learning and faster convergence. Instead of using fixed training, validation, and test splits, the entire dataset was evaluated using 5-fold stratified cross-validation to ensure robust performance estimation and preserve class balance across all folds. The distribution of images per class is detailed in Table 1. For the modified YOLOv8 detection model, only defective images (excluding the “normal” class) were retained and labeled with bounding boxes corresponding to each defect type. We used CVAT (Computer Vision Annotation Tool) for annotation, enabling precise and structured labeling compatible with the YOLO format. Images were resized to 640 × 640 pixels, as required by the YOLOv8 input standard, and split into training, validation, and test sets using a 70/15/15 ratio. The distribution is shown in Table 2. This diverse and carefully prepared dataset—characterized by variable lighting conditions, pipe textures, defect shapes, camera perspectives, and deliberately retained challenging visual conditions—provides a robust foundation for both accurate classification and precise localization under real-world operational constraints, facilitating the development of a scalable AI solution for automated sewer inspection that maintains reliability across diverse deployment scenarios.
Table 1.
Number of images in each class used for hybrid ResNet50-Swin Transformer.
| Defect class | Train/Validation | Test |
|---|---|---|
| Root intrusion | 1450 | 290 |
| Deposit | 1530 | 306 |
| Open joint | 1580 | 316 |
| Normal (No Defect) | 1200 | 240 |
| Total | 5760 | 1152 |
Table 2.
Number of images in each category used for training, validation, and testing used for modified YOLOv8.
| Defect Class | Training | Validation | Test |
|---|---|---|---|
| Root Intrusion | 596 | 145 | 150 |
| Deposit | 500 | 153 | 144 |
| Open Joint | 676 | 150 | 148 |
| Total | 1772 | 442 | 442 |
Experiments and results
All experiments were conducted using Google Colab Pro, utilizing NVIDIA Tesla T4 GPU (16GB VRAM), dual Intel Xeon CPUs @ 2.20 GHz, 25GB RAM, and SSD storage. This platform choice demonstrates the accessibility of our approach, as the computational requirements remain within the capabilities of widely available cloud computing resources.
Training Efficiency on Colab Pro:
Hybrid classification model: 4–6 h per cross-validation fold.
Modified YOLOv8 detection: 4–5 h for 200 epochs.
Memory utilization: Peak 8GB during training, 2GB during inference.
Session management: Training distributed across multiple sessions due to platform timeout limitations.
Classification result
The classification stage of the proposed framework employs a transfer learning strategy using a hybrid ResNet50–Swin TransformerV2 model. The ResNet50 backbone, pre-trained on ImageNet, is responsible for local feature extraction, while Swin TransformerV2 complements it by capturing global contextual relationships. In the hybrid ResNet50–Swin Transformer architecture, the original classification layers of ResNet50 were removed and replaced with a custom classification head consisting of a flatten layer, a dropout layer with a rate of 0.5 to reduce overfitting, and a dense output layer with softmax activation to predict the four sewer defect classes. Input images were resized to 224 × 224 pixels, and training was performed with a batch size of 16 on Google Colab Pro. All model development and training were carried out using PyTorch version 2.1. To preserve transferable low-level features, the model was partially fine-tuned by freezing the early convolutional layers—specifically, all parameters in layer.0 (Layer1) and layer.1 (Layer2) of ResNet50. These correspond to 7 bottleneck blocks (21 convolutional layers), which remained fixed, while the deeper layers (layer.2 and layer.3) and the transformer stages were trainable, allowing the model to adapt higher-level representations to the sewer inspection domain.
To improve generalization, on-the-fly data augmentation was applied to the training set. This included random horizontal flips, random rotations up to ± 10°, and random resized cropping (scale: 0.9–1.0). Images were normalized using the ImageNet mean and standard deviation. For validation, only resizing and normalization were applied to ensure consistency in evaluation. The model was optimized using the Adam optimizer with an initial learning rate of 0.001. A ReduceLROnPlateau scheduler adaptively reduced the learning rate by a factor of 0.5 when the validation loss plateaued. Additionally, early stopping with a patience of 10 epochs prevented overfitting and reduced training time. The best model weights were saved using the ModelCheckpoint callback based on validation accuracy.
The classification performance of the proposed Hybrid ResNet50–Swin Transformer model was thoroughly evaluated using 5-fold stratified cross-validation during training, followed by final testing on a separate, fixed test set. This approach ensured that the entire dataset contributed to robust model training and validation, while the test set remained completely unseen during development, enabling an unbiased assessment of generalization capability. The confusion matrices across all five folds are illustrated in Fig. 15, demonstrating consistent behavior across different data partitions. Standard evaluation metrics—precision, recall, and F1-score—were used to quantify classification performance across all folds in Table 3. After selecting the final model, its performance was evaluated on the fixed test set, with results presented in Fig. 16 (confusion matrix) and Table 4 (per-class metrics). The model achieved strong performance across all four sewer defect categories: deposit, root intrusion, open joint, and normal (defect-free).
Fig. 15.
Confusion matrices across 5-fold cross-validation.
Fig. 16.

Confusion matrix on testing set.
Table 4.
precision, recall, and F1-score, accuracy on test set.
| Class | Precision% | Recall% | F1-Score% | Accuracy% |
|---|---|---|---|---|
| Deposit | 92.76 | 92.16 | 92.46 | |
| Normal | 92.51 | 92.51 | 92.50 | |
| Open joint | 94.88 | 93.98 | 94.43 | |
| Root intrusion | 90.84 | 92.41 | 91.62 | |
| Overall | 90.28 | |||
Table 3.
Average performance metrics across folds for defect Classification.
| Class | Precision% | Recall% | F1-Score% | Accuracy% |
|---|---|---|---|---|
| Deposit | 90.50 | 92.07 | 91.28 | |
| Normal | 93.33 | 91.91 | 92.61 | |
| Open joint | 93.26 | 94.41 | 93.83 | |
| Root intrusion | 92.66 | 90.92 | 91.77 | |
| Overall | 92.49 | |||
To provide interpretability analysis of our classification models, we implemented Grad-CAM visualization (Fig. 17) to compare our hybrid ResNet50-Swin Transformer architecture against its individual components. Since our hybrid model integrates ResNet50 for local feature extraction with Swin Transformer for global contextual modeling, we trained and evaluated three separate architectures: standalone ResNet50, standalone Swin Transformer, and the complete hybrid system.
Fig. 17.
Grad-CAM visualization comparison showing attention patterns across three independently trained models: ResNet50, Swin Transformer, and hybrid ResNet50-Swin Transformer for representative samples of each defect class. Different architectures demonstrate distinct attention mechanisms for defect localization.
The Grad-CAM visualizations reveal distinct attention mechanisms across architectures that validate our fusion strategy. ResNet50 demonstrates focused spatial attention with sharp localization on defect regions, effectively capturing local textural patterns and boundaries. The Swin Transformer exhibits distributed contextual attention that captures broader spatial relationships but with less precise localization. Our hybrid ResNet50-Swin Transformer combines these complementary strengths, showing refined attention patterns that maintain spatial precision while incorporating enhanced contextual awareness from the global reasoning capabilities.
This comparative analysis demonstrates that the hybrid architecture successfully integrates the local feature extraction strengths of ResNet50 with the global contextual modeling of Swin Transformer, producing attention patterns that neither component achieves independently. These visualization results provide interpretability evidence supporting our quantitative findings, where the hybrid model achieved 90.28% accuracy compared to 67.89% and 75.75% for standalone ResNet50 and Swin Transformer, confirming that the architectural fusion delivers measurable performance improvements beyond what individual components can accomplish.
Defect localization results
The modified YOLOv8 model was trained and evaluated on Google Colab Pro, utilizing the same hardware resources described earlier. Training was conducted in a Windows-based environment using Python 3.10 and PyTorch 2.1 with CUDA 12.1 for GPU acceleration. The model underwent fine-tuning for 200 epochs using official pre-trained weights, with all layers unfrozen to enable comprehensive end-to-end optimization. Training utilized a batch size of 16 and a learning rate of 0.01. All other hyperparameters were preserved from the original YOLOv8 configuration to ensure consistent and robust benchmarking. Figure 18 displays the training and validation loss curves for the modified YOLOv8 model in detecting sewer defects—specifically open joints, root intrusions, and deposits. The four plots represent bounding box regression (box loss) and classification (cls loss) losses across 200 epochs. The x-axis denotes epochs, and the y-axis shows the corresponding loss values. Both training and validation losses demonstrate a steady downward trend, indicating effective convergence. The smoothed loss curves further emphasize the model’s stable learning and absence of overfitting, supporting its reliability in real-world defect detection tasks.
Fig. 18.
Train/Val losses over training process.
To evaluate the detection performance of the proposed system, we employed the mean Average Precision (mAP) metric, which integrates the Average Precision (AP) across individual defect classes—open joints, root intrusions, and deposits—based on the area under their respective precision-recall curves.
As shown in Fig. 19 (b), when using the modified YOLOv8 model alone, the detection achieved AP values of 0.84 (open joints), 0.74 (root intrusions), and 0.53 (deposits), resulting in a baseline mAP of 0.70 at an Intersection over Union (IoU) threshold of 0.5. In contrast, Fig. 19 (a) demonstrates the substantial performance gain achieved by incorporating a hybrid ResNet50–Swin Transformer classifier as a filtering stage prior to detection. This first-stage classifier accurately distinguishes defective from non-defective images with 90.28% accuracy, effectively reducing false positives and minimizing irrelevant input to the detector.
Fig. 19.
Precision-Recall (P-R) curves of the modified YOLOv8 on validation set: (a) YOLOv8n + CBAM, (b) YOLOv8n without CBAM.
As a result of this two-stage pipeline, the modified YOLOv8 model receives only high-confidence true positive (TP) samples, improving its class-specific APs to 0.92 (open joints), 0.87 (root intrusions), and 0.65 (deposits), and boosting the overall mAP to 0.813—an 11% improvement over the single-stage setup. This clear gain underlines the value of the filtering mechanism in enhancing precision and reducing the computational burden on the object detection module. Additionally, the YOLOv8 model integrates Convolutional Block Attention Modules (CBAM), further refining its sensitivity to subtle features and ambiguous defect boundaries. This is especially beneficial for challenging cases like deposits, which often appear in low-contrast or noisy imaging conditions. The results are validated over a comprehensive dataset collected from over 200 polyethylene sewer pipelines across Iran.
In the modified YOLOv8 model, we evaluated the defect localization performance across three critical defect categories: root intrusion, deposit, and open joint, along with a background class to capture non-defective regions. The model demonstrated strong detection capabilities, achieving mean Average Precision (AP) scores of 0.92 for open joints, 0.87 for root intrusions, and 0.65 for deposits at an Intersection-over-Union (IoU) threshold of 0.5. As illustrated in Fig. 20, the normalized confusion matrix on test set highlights the model’s high reliability in detecting open joints (85%) and root intrusions (76%), while the deposit class showed notably lower accuracy (58%) and considerable confusion with both the background and root classes.
Fig. 20.

Normalized confusion matrix of the modified YOLOv8 model evaluated on the test set.
This discrepancy is primarily attributed to the fact that deposit features often lack distinct visual boundaries, making them partially invisible or ambiguous in low-contrast environments. Moreover, in many cases, old or degraded root intrusions that have spread thinly across the pipe surface tend to be misinterpreted as deposits, leading to further misclassifications. Despite these challenges, the model maintains reasonably strong localization performance, reinforcing the effectiveness of the attention-enhanced YOLOv8 architecture in detecting diverse sewer pipe defects under complex conditions. These results support the viability of the proposed system as a reliable tool for automated and efficient sewer infrastructure inspection.
A two-stage pipeline enhanced defect localization accuracy in sewer inspection images by integrating a hybrid ResNet50–Swin Transformer classifier with a modified YOLOv8 detector. Initially, the ResNet50–Swin Transformer model was employed to classify images as defective or non-defective, and only those identified as defective were forwarded to the YOLOv8 model for localization. This strategy significantly reduced the detection noise and focused the localization process on relevant instances, leading to improved performance. As shown in the normalized confusion matrix (Fig. 21), this approach yielded higher true positive rates compared to using YOLOv8 alone. Specifically, the model achieved normalized true positive rates of 0.90 for Open Joint, 0.87 for Root Intrusion, and 0.67 for Deposit. These improvements confirm that leveraging high-confidence classifications from the hybrid model effectively guided the detector, particularly in minimizing false positives for background and improving localization focus. The results highlight the advantage of combining classification and detection in a targeted pipeline, making it a promising solution for reliable, real-time defect localization in complex sewer environments.
Fig. 21.

Normalized confusion matrix based on true positives from the testing set of the Hybrid ResNet50–Swin Transformer model for sewer defect classification.
The integration of a modified YOLOv8 model with the hybrid ResNet50–Swin Transformer significantly enhances the overall defect detection pipeline by combining high classification accuracy with precise spatial localization. In the initial evaluation, YOLOv8 alone achieved a mean Average Precision (mAP@0.5) of 0.75 and mAP@[0.50:0.95] of 0.54 on the test set after 200 epochs, demonstrating strong performance in localizing sewer defects across various classes and conditions. However, when YOLOv8 was applied exclusively to the true positive images identified by the ResNet50–Swin Transformer classifier, the model’s performance improved noticeably, reaching an mAP@0.5 of 0.81 and an mAP@[0.50:0.95] of 0.66. This gain highlights the benefit of a two-stage strategy, in which the classifier effectively filters out non-defective images, allowing YOLOv8 to focus solely on relevant regions with higher localization precision. The improvement underscores a synergistic relationship between classification and detection, where ResNet50-Swin Transformer’s robust feature-based classification boosts YOLOv8’s localization capability. The performance comparison is summarized in Table 5.
Table 5.
Mean average precision (mAP) of modified YOLOv8 on the test set, and hybrid ResNet50–Swin transformer true Positives.
| Dataset | mAP@50 | mAP@50:95 | Epoch |
|---|---|---|---|
| Test set | 0.70 | 0.54 | 200 |
| ResNet50–Swin Transformer TPs Only | 0.81 | 0.66 | 200 |
Figure 22 illustrates three specific examples where the integration of the hybrid ResNet50–Swin Transformer classifier with the modified YOLOv8 detector significantly improves detection accuracy by filtering out false positives before localization. In case (a), the presence of smooth water flow during sewer inspection conditions has been incorrectly identified as “Deposit” by YOLOv8, although no actual defect is present. In case (b), a simple flow pattern within the pipe has been mistakenly classified as an “Open Joint” due to the model’s bias in recognizing similar geometric patterns. In case (c), a foreign object hanging from a side inlet, which visually resembles a tree root, is misclassified as “Root Intrusion” despite being materially different. However, using the proposed two-stage approach—first applying the ResNet50–Swin Transformer classifier to filter out normal (non-defective) images, and then running the modified YOLOv8 on only the true positive outputs—such misclassifications can be significantly reduced. This highlights the major advantage of the hybrid framework introduced in this study.
Fig. 22.
Examples demonstrating the effectiveness of the proposed two-stage defect detection pipeline. Each row shows: (left) the ground truth image, (middle) the correct “normal” classification by the ResNet50–Swin Transformer classifier, and (right) incorrect defect detection results by the modified YOLOv8 model.
Figure 23 presents a comparative evaluation of the baseline YOLOv8 model and the proposed modified YOLOv8 model, which integrates an attention mechanism and an anchor-free architecture. The modified model was assessed on the test dataset to validate its effectiveness in defect detection under real-world sewer inspection conditions. Results indicate a clear performance advantage of the modified YOLOv8 over its baseline counterpart, particularly in complex environments characterized by low illumination, shadows, poor contrast, and irregular defect morphology. The enhanced model demonstrates significantly improved capability in identifying and localizing defects such as root intrusions, deposits, and open joints, with consistently higher confidence scores. This improvement is visually evident through tighter and more accurate bounding boxes, which more closely align with the ground truth annotations, especially in challenging detection scenarios. In contrast, the baseline model frequently fails to detect true positives under similar conditions, often generating imprecise or incomplete localizations.
Fig. 23.
The ground truths and the prediction results by YOLOv8 and Modified YOLOv8 for sample images.
The superior performance of the modified YOLOv8 model can be attributed to two key enhancements. First, the anchor-free mechanism enables dynamic adjustment of bounding box predictions, allowing the model to better adapt to multi-scale and variably shaped defects without relying on predefined anchor templates. This flexibility is particularly beneficial in handling the wide spatial variability of defects in sewer pipelines. Second, the integration of an attention module enhances the model’s ability to focus on salient spatial features while suppressing background noise. This leads to improved detection robustness in low-visibility areas or in cases where defects are partially occluded or distant from structural elements such as joints. These enhancements collectively reduce false negatives and improve the precision of defect localization, resulting in confidence scores frequently exceeding 0.80 in complex inspection environments. The close alignment between predicted bounding boxes and annotated ground truth in Fig. 23 validates the model’s increased sensitivity and specificity. Overall, this improved architecture offers a scalable and reliable solution for automated infrastructure inspection, effectively addressing the visual complexity inherent in real-world sewer defect detection tasks.
Discussion
To evaluate the robustness and generalization capability of our proposed hybrid classification model in recognizing sewer pipe defects, we conducted a comprehensive comparison against several recent deep learning architectures applied to our dataset. As shown in Table 6, we benchmarked our approach against a range of established models, including conventional CNN-based networks (e.g., ResNet50, VGG16, MobileNetV2, InceptionV3) as well as transformer-based and hybrid approaches. All models were trained and tested on the same dataset split to ensure a fair and consistent evaluation.
Table 6.
Ablation study comparing the proposed hybrid ResNet50–Swin transformer model with state-of-the-art CNN architectures on sewer defect classification.
| Model | Precision% | Recall% | F1-Score% | Accuracy | Params (M) | Model size (MB) | Latency (ms) |
|---|---|---|---|---|---|---|---|
| Hybrid ResNet50-Swin Transformer (Ours) | 92.74 | 92.76 | 92.75 | 90.28 | 29.98 | 114.5 | 12.0 |
| ResNet5042 | 71.25 | 69.77 | 70.68 | 67.89 | 24.0 | 91.51 | 8.5 |
| Swin Transformer | 78.21 | 77.05 | 78.47 | 75.75 | 28.34 | 108.14 | 10.5 |
| MobileNetV243 | 68.39 | 61.64 | 64.46 | 58.45 | 2.28 | 8.63 | 4.5 |
| InceptionV344 | 72.38 | 70.55 | 69.98 | 68.0 | 21.83 | 83.19 | 7.5 |
| Kumar et.al11 | 76.23 | 73.75 | 75.12 | 72.50 | 269.55 | 22.0 | |
| VGG1645 | 81.2 | 79.87 | 79.5 | 79.0 | 20.02 | 98.39 | 16.0 |
Our Hybrid ResNet50–Swin Transformer achieved the best overall performance, with 92.74% precision, 92.76% recall, 92.75% F1-score, and 90.28% accuracy, significantly outperforming the compared models. For example, ResNet50 and Swin Transformer alone reached only 67.89% and 75.75% accuracy, respectively, while lightweight MobileNetV2 achieved 58.45%. Even deeper CNNs such as InceptionV3 (68.0%) and VGG16 (79.0%) fell well short of our model’s performance. The improvements can be attributed to the integration of ResNet50’s residual learning for capturing fine local texture variations and the Swin Transformer’s hierarchical attention for modeling long-range dependencies. This synergy allows the hybrid model to effectively generalize across defects of varying scales, shapes, and appearances—conditions that are common in sewer inspection footage. In contrast, baseline models often struggled under degraded lighting or noisy environments.
The ablation study presented in Table 6 highlights the superior performance of the proposed Hybrid ResNet50-SwinTransformer model for sewer defect classification, achieving an impressive precision, recall, F1-score, and accuracy of 90.28% each, with 114.5 M parameters. In comparison, traditional CNN architectures such as ResNet50 (67.89% accuracy), MobileNetV2 (58.45%), InceptionV3 (68.0%), Kumar et al. (72.50%), and VGG16 (79.0%) fall significantly short, with lower metric scores and varying parameter counts. This demonstrates the Hybrid model’s enhanced capability to leverage the strengths of both ResNet50 and Swin Transformer, offering a robust solution for accurate defect detection while maintaining a reasonable parameter size.
The effectiveness of the proposed Modified YOLOv8n with CBAM is demonstrated through an ablation study summarized in Table 7, which compares its performance against several state-of-the-art object detection models for sewer defect detection. As shown in the table, our modified model achieves the highest mAP@50 of 0.81 and mAP@50:95 of 0.66, outperforming the original YOLOv8n (0.76 and 0.57, respectively) and significantly surpassing models such as Faster R-CNN, YOLOv3-SPP, YOLOv5s, and YOLOv7-tiny. Importantly, this performance gain comes with only a moderate increase in parameters (4.5 M) and maintains a high inference speed of 210 FPS, demonstrating a strong balance between accuracy and efficiency. These results confirm that integrating the CBAM attention module into YOLOv8 enhances its feature representation capability, making it more effective for detecting complex sewer pipe defects in real time.
The comparative analysis in Table 7 reveals that attention mechanism selection significantly impacts both performance and computational requirements. While CA (Coordinate Attention) achieves the highest mAP@50 of 0.82, the marginal improvement over CBAM (0.81) comes at the cost of increased computational complexity (4.7 M vs. 4.5 M parameters) and reduced inference speed (200 vs. 210 FPS). ECA demonstrates competitive performance (mAP@50: 0.8) with slightly lower parameters (4.2 M) but reduced detection precision (mAP@50:95: 0.64). SE attention shows the lowest overall performance across all metrics. Based on this comprehensive evaluation, CBAM emerges as the optimal choice for our sewer inspection pipeline, providing the best balance between detection accuracy and real-time processing requirements.
The comparison with recent YOLO architectures (YOLOv9, YOLOv10) provides important insights into the evolution of object detection frameworks and their applicability to infrastructure inspection tasks. While newer architectures like YOLOv9 demonstrate superior raw detection performance, the substantial increase in computational requirements (5.7× more parameters, 12.1× more FLOPS) raises concerns about practical deployment feasibility, particularly in resource-constrained field environments. Our selection of YOLOv8 as the base architecture, enhanced with CBAM attention mechanisms, represents a deliberate engineering decision that prioritizes deployment practicality without significantly compromising detection accuracy. The 6.6% performance gap compared to YOLOv9 (0.81 vs. 0.86 mAP@50) is justified by the substantial computational savings and real-time processing capabilities essential for automated sewer inspection systems.
Conclusions and future work
In this study, we proposed a deep learning-based approach to automatically detect and classify sewer pipe defects using real-world inspection images from polyethylene pipes collected across more than 200 CCTV inspections. Our primary contributions include the development of a Hybrid ResNet50–Swin Transformer classifier and a Modified YOLOv8n detector enhanced with CBAM, both of which were thoroughly evaluated against multiple state-of-the-art architectures on a diverse and challenging dataset comprising three common defect types: root intrusions, deposits, and open joints.
The experimental results demonstrate that our proposed classification and detection models consistently outperform traditional CNN-based architectures. Specifically, the Hybrid ResNet50–Swin Transformer achieved the highest classification performance, outperforming ResNet50, InceptionV3, VGG16, MobileNetV2, and even prior defect classification approaches in terms of accuracy, robustness, and parameter efficiency. Similarly, our Modified YOLOv8n, enhanced with CBAM for better attention to critical regions, achieved a mAP@50 of 0.81 and mAP@50:95 of 0.66, along with superior inference speed and low computational cost and Our experimental evaluation against the latest YOLO architectures, including YOLOv9 and YOLOv10, demonstrates that the proposed Modified YOLOv8n + CBAM framework achieves competitive detection performance while maintaining optimal computational efficiency for practical deployment scenarios. These results underscore the model’s capability to support real-time and accurate defect detection in practical field conditions. Crucially, the performance metrics indicate that our system is well-suited for detecting defects in sewer systems under diverse environmental and operational conditions. The models proved effective across multiple pipe sizes and imaging conditions, highlighting their generalization capability and practical relevance for infrastructure monitoring. Our work is particularly significant in the context of Iran’s sewer network, where root intrusions, deposits, and open joints represent the most frequently occurring and structurally threatening defects. By focusing on these specific classes, we provide a targeted and high-performing solution that can enhance the reliability and speed of sewer inspection processes in the region.
Looking ahead, future research will aim to expand the range of detectable defects beyond the current three categories. Sewer systems often experience a wider variety of issues, including cracks, fractures, collapses, infiltration/exfiltration, corrosion, and water leakage, which were not addressed in this study due to the very low frequency of such defects in the dataset. Moreover, certain issues—such as corrosion, cracks, and fractures—are less likely to occur given the material of the sewer pipes in our study, which are made of polyethylene. By collecting and labeling additional datasets representing these conditions, we plan to extend the model’s detection capabilities and build a more comprehensive and scalable diagnostic tool for sewer health monitoring. In addition, integrating the model with robotic inspection platforms and geolocation systems could facilitate end-to-end automation of defect localization, reporting, and prioritization for maintenance crews. The promising results of this study establish a solid foundation for deploying AI-based systems in real-world sewer inspection workflows, contributing to improved urban infrastructure resilience, environmental protection, and public health outcomes.
While this study establishes a solid foundation for AI-based sewer inspection, several methodological and practical constraints should be acknowledged. Our approach is primarily validated on polyethylene pipe infrastructure within Iranian environmental conditions, which may limit generalizability to different pipe materials, regional installation practices, or operational environments. The computational requirements (114.5 M parameters for classification, 8GB GPU memory for training) and dependency on expert annotation may present deployment challenges in resource-constrained settings. Additionally, our imaging platform’s technical specifications (480 TVL resolution, single-point illumination) represent inherent constraints that could affect detection performance under varying field conditions. The dataset scope, while comprehensive with 6,912 images from over 200 inspections, reflects the specific defect distribution patterns of polyethylene systems and may not capture the full spectrum of conditions encountered in diverse geographical or infrastructural contexts. These limitations highlight opportunities for future enhancement through cross-regional validation, lightweight architecture development, and expanded defect taxonomy integration.
Acknowledgements
We would like to extend our sincere gratitude to Kavosh Mechanized Inspection Technology Inc. for their invaluable assistance in gathering the data utilized in this study. Their support and collaboration played a pivotal role in facilitating the acquisition of the necessary datasets, enabling us to conduct thorough analysis and experimentation. We are deeply appreciative of their commitment to advancing research and innovation in the field of sewer defect detection. Additionally, this study utilized LLM to assist with language refinement, grammar correction, and improving the clarity and flow of the manuscript. All scientific content, data analysis, and conclusions were authored and validated solely by the researchers.
Author contributions
Saleh Goharinezhad: Formal analyser, Writing—original draft, Methodology. Alireza Hadi: Methodology, Project administration, Supervision. Sayeh Mirzaei: Conceptualization, Project administration.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Data availability
The data supporting the findings of this study were provided under license from Tehran Province Water and Wastewater Company and Kavosh Mechanized Inspection Technology Inc. Due to contractual restrictions, these data are not publicly available. However, they may be obtained from the corresponding author upon reasonable request and with permission from the aforementioned providers.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.EPA. Report to Congress on Impacts and Control of Combined Sewer Overflows and Sanitary Sewer Overflows. https://www.epa.gov/sites/default/files/2015-10/documents/csossortc2004_full.pdf.
- 2.EPA. Why Control Sanitary Sewer Overflows. U.S. Environmental Protection Agency. https://www.epa.gov/npdes/sanitary-sewer-overflows-ssos.
- 3.ASCE. Wastewater Infrastructure Report Card. ASCE. (accessed May 2, 2019). https://www.infrastructurereportcard.org/wp-content/uploads/2017/01/Wastewater-Final.pdf
- 4.Wong, K. & Allan, R.Hong Kong Conduit Condition Evaluation Codes Ed (Utility Training Institute, 2009).
- 5.Cheng, J. C. & Wang, M. Automated detection of sewer pipe defects in closed-circuit television images using deep learning techniques. Autom. Constr.95, 155–171 (2018). [Google Scholar]
- 6.Australasia, T. Open Joints in Sewer Pipelines. https://www.trenchless-australasia.com.
- 7.LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521(7553), 436–444 (2015). [DOI] [PubMed] [Google Scholar]
- 8.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: towards real-time object detection with region proposal networks. Advances Neural Inform. Process. Systems, 28, (2015). [DOI] [PubMed]
- 9.Redmon, J. & Farhadi, A. YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271 (2017).
- 10.Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision. 115, 211–252 (2015). [Google Scholar]
- 11.Kumar, S. S., Abraham, D. M., Jahanshahi, M. R., Iseley, T. & Starr, J. Automated defect classification in sewer closed circuit television inspections using deep convolutional neural networks. Autom. Constr.91, 273–283 (2018). [Google Scholar]
- 12.Hassan, S. I. et al. Underground sewer pipe condition assessment based on convolutional neural networks. Autom. Constr.106, 102849 (2019). [Google Scholar]
- 13.Xie, Q., Li, D., Xu, J., Yu, Z. & Wang, J. Automatic detection and classification of sewer defects via hierarchical deep learning. IEEE Trans. Autom. Sci. Eng.16 (4), 1836–1847 (2019). [Google Scholar]
- 14.Kumar, S. S. & Abraham, D. M. A deep learning based automated structural defect detection system for sewer pipelines. In ASCE International Conference on Computing in Civil Engineering 2019, 2019: American Society of Civil Engineers Reston, VA, pp. 226–233.
- 15.Li, D., Cong, A. & Guo, S. Sewer damage detection from imbalanced CCTV inspection data using deep convolutional neural networks with hierarchical classification. Autom. Constr.101, 199–208 (2019). [Google Scholar]
- 16.Wang, Y., Fan, J. & Sun, Y. Classification of sewer pipe defects based on an automatically designed convolutional neural network. Expert Syst. Appl., 125806 (2024).
- 17.Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587 (2014).
- 18.Girshick, R. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448 (2015).
- 19.Ren, S., He, K., Girshick, R., Sun, J., Faster, R-C-N-N. & Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39 (6), 1137–1149 (2016). [DOI] [PubMed] [Google Scholar]
- 20.He, K., Zhang, X., Ren, S. & Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.37 (9), 1904–1916 (2015). [DOI] [PubMed] [Google Scholar]
- 21.Liu, W. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, Proceedings, Part I 14, 2016, pp. 21–37 (Springer, 2016).
- 22.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788 (2016).
- 23.Wang, M. & Cheng, J. C. Development and improvement of deep learning based automated defect detection for sewer pipe inspection using faster R-CNN. In Advanced Computing Strategies for Engineering: 25th EG-ICE International Workshop Lausanne, Switzerland, June 10–13, 2018, Proceedings, Part II 25, 2018: Springer, pp. 171–192 (2018).
- 24.Chen, Y. et al. Deep learning based underground sewer defect classification using a modified RegNet. Comput. Mater. Contin. 75, 5455–5473 (2023). [Google Scholar]
- 25.Yu, Z., Li, X., Sun, L., Zhu, J. & Lin, J. A composite transformer-based multi-stage defect detection architecture for sewer pipes. Comput. Mater. Continua 78 (1) (2024).
- 26.Yang, Z., Weng, H., Sun, Y., Chen, X. & Guan, J. Drainage Pipeline Defect Detection Based on Improved YOLOX. In 3rd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE), pp. 457–465 (IEEE, 2024).
- 27.He, P. et al. LSIDA-YOLOV7: An optimized YOLOv7 based on local sensitive information data augmentation for sewer pipeline defect detection. In Proceedings of the 4th International Conference on Computer, Artificial Intelligence and Control Engineering, pp. 914–918 (2025).
- 28.Dong, J. & Liao, M. Defect Detection of Urban Drainage Pipeline Based on Improved YOLO-V8. In 2024 IEEE 7th International Conference on Information Systems and Computer Aided Education (ICISCAE), pp. 284–289 (IEEE, 2024).
- 29.Zhao, C., Hu, C., Shao, H. & Liu, J. PipeMamba: state space model for efficient Video-based sewer defect classification. IEEE Trans. Artif. Intell. (2025).
- 30.Hu, C., Zhao, C., Shao, H., Deng, J. & Wang, Y. TMFF: Trustworthy multi-focus fusion framework for multi-label sewer defect classification in sewer inspection videos. IEEE Trans. Circuits Syst. Video Technol. (2024).
- 31.Zhao, C., Hu, C., Shao, H., Dunkin, F. & Wang, Y. Trusted video-based sewer inspection via support clip-based pareto-optimal evidential network. IEEE Signal. Process. Lett. (2024).
- 32.Li, M. et al. PipeTransUNet: CNN and transformer fusion network for semantic segmentation and severity quantification of multiple sewer pipe defects. Appl. Soft Comput.159, 111673 (2024). [Google Scholar]
- 33.Pan, G., Zheng, Y., Guo, S. & Lv, Y. Automatic sewer pipe defect semantic segmentation based on improved U-Net. Autom. Constr.119, 103383 (2020). [Google Scholar]
- 34.Zhou, Q. et al. Automatic sewer defect detection and severity quantification based on pixel-level semantic segmentation. Tunn. Undergr. Space Technol.123, 104403 (2022). [Google Scholar]
- 35.Yong, H., Huang, J., Meng, D., Hua, X. & Zhang, L. Momentum batch normalization for deep learning with small batch size, in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, Proceedings, Part XII 16, pp. 224–240 (Springer, 2020).
- 36.He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks, in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, Proceedings, Part IV 14, 2016, pp. 630–645. (Springer, 2016).
- 37.Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018).
- 38.Chen, Y., Kalantidis, Y., Li, J., Yan, S. & Feng, J. A^ 2-nets: double attention networks. Advances Neural Inform. Process. Syst. 31 (2018).
- 39.Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017).
- 40.Wei, Y., Xu, Z., Dai, C. & Zhang, W. Bidirectional feature pyramid for face detection in dense crowd scenes, MCCSIS, p. 139 (2024).
- 41.Ultralytics. "YOLOv8 Docs – Ultralytics (accessed May 2, 2025). https://docs.ultralytics.com/
- 42.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016).
- 43.Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L. C. Mobilenetv2: Inverted residuals and linear bottlenecks, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520 (2018).
- 44.Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826 (2016).
- 45.Qassim, H., Verma, A. & Feinzimer, D. Compressed residual-VGG16 CNN model for big data places image recognition. In 2018 IEEE 8th annual computing and communication workshop and conference (CCWC), pp. 169–175 (IEEE, 2018).
- 46.He, J., Wang, Z., Yong, Z., Yang, C. & Li, T. An automatic defect detection and localization method using imaging geometric features for sewer pipes. Measurement243, 116367 (2025). [Google Scholar]
- 47.Ultralytics. Ultralytics YOLO Docs. https://docs.ultralytics.com/models/yolov9/#what-tasks-and-modes-does-yolov9-support.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The data supporting the findings of this study were provided under license from Tehran Province Water and Wastewater Company and Kavosh Mechanized Inspection Technology Inc. Due to contractual restrictions, these data are not publicly available. However, they may be obtained from the corresponding author upon reasonable request and with permission from the aforementioned providers.






























