Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Dec 16;16:2907. doi: 10.1038/s41598-025-32817-x

Enhanced YOLOv8 for accurate and efficient floating object detection on water surfaces

YanPeng Cao 1,, HaoWen Luo 2,, MengDi Wang 3,, Yue Wang 1, Hao Yan 1
PMCID: PMC12827348  PMID: 41402541

Abstract

Detecting floating objects on water surfaces is critical for environmental monitoring and pollution control. We present SEDS-YOLOv8, an enhanced YOLOv8n variant that integrates Squeeze-and-Excitation (SE) attention, Distribution Shift Convolution (DSConv), and Enhanced Intersection over Union (EIoU) loss to address challenges in complex aquatic environments. These environments are characterized by surface reflections, ripple-induced noise, and dense small debris that complicate accurate detection. We trained and evaluated our model on a hybrid dataset of 28,000 images with data augmentation for robustness. The SEDSConv module replaces selected convolutional layers with DSConv for efficient multi-scale feature extraction, while SE attention suppresses reflection-induced channel noise by recalibrating feature responses. The EIoU loss accelerates convergence and improves localization accuracy through decoupled width-height regression. SEDS-YOLOv8 achieves 86.02% precision, 85.01% recall, and 88.82% mAP@0.5 with 2.90M parameters and 7.60 GFLOPs (7.3% fewer than the baseline YOLOv8n at 8.20 GFLOPs), while maintaining real-time inference at 103.7 FPS on NVIDIA RTX 4090 hardware. Our contribution is the systematic integration and adaptation of existing techniques to water-surface detection, demonstrating that task-specific architectural choices can substantially improve accuracy without sacrificing computational efficiency. Code and dataset are publicly available.

Keywords: Water surface floater detection, Distribution shifting convolution, Squeeze-and-excitation networks, Lightweight model, YOLOv8, Computer vision

Subject terms: Engineering, Environmental sciences, Mathematics and computing

Introduction

Monitoring floating debris on water surfaces is essential for environmental governance, yet manual inspection is costly and inconsistent. Practical systems must sustain real-time operation under harsh visual conditions–specular reflections, ripple-induced blur, turbidity, and unstable illumination–that degrade channel signal-to-noise ratios and destabilize localization. These factors make water-surface detection systematically harder than generic object detection.

Early rule-based pipelines and hand-crafted features cannot cope with the non-stationary appearance of water scenes and require constant retuning despite classical components such as non-maximum suppression 1 and spatial pyramid pooling 2. Two-stage detectors 3,4 deliver strong accuracy, and recent developments in remote-sensing knowledge distillation 5,6 and vision transformers 7,8 further improve representation power. However, their multi-stage pipelines, large models, and quadratic attention costs impede deployment on edge devices and, crucially, they do not directly target reflection and ripple interference typical of water surfaces.

Single-stage YOLO variants 915 emphasize throughput via lightweight modules 16, feature-path aggregation 17, and modern loss functions 1821. Attention mechanisms–from simple channel recalibration to external attention and video spatiotemporal fusion 2225–offer robustness, yet most works address generic scenes or video surveillance rather than the optical artefacts unique to water. In aquatic debris detection, researchers have tailored YOLO-based detectors to rivers and underwater scenes, including improved YOLOv5/YOLOv7 for floating garbage 26,27, plastic-waste monitoring and diffusion-model baselines 28,29, marine object detection with enhanced path aggregation 30, and underwater multi-scale fusion 31, while hierarchical evidence aggregation for water surfaces has also been explored 32. Despite these advances, existing methods leave a critical gap: unlike video-based spatiotemporal fusion or transformer-based external attention optimized for surveillance, our task requires real-time single-frame inference while explicitly suppressing reflection-dominated channel noise and stabilizing geometry under ripples–conditions unaddressed by prior work.

We close this gap with SEDS-YOLOv8, a task-oriented integration of three proven ideas within the YOLOv8n pipeline: Squeeze-and-Excitation (SE) for channel recalibration against reflections, Distribution Shift Convolution (DSConv) for efficiency, and Enhanced IoU (EIoU) for stable box regression. This combination yields 88.82% mAP@0.5 with 2.90M parameters and 7.60 GFLOPs (7.3% fewer than YOLOv8n) at 103.7 FPS, demonstrating that careful, scene-specific integration—rather than proposing new heavy modules—offers a practical accuracy–efficiency trade-off for water-surface monitoring.

YOLOv8 network overview

YOLOv8 offers five variants (n, s, m, l, x) with increasing complexity and computational cost. We adopt the lightweight YOLOv8n variant to balance accuracy and real-time performance on resource-constrained hardware. YOLOv8n comprises four components: Input, Backbone, Neck, and Head.

Figure 1 shows the standard YOLOv8 architecture with four main components. The Input applies Mosaic augmentation and resizes images to 640Inline graphic640 pixels. The Backbone extracts hierarchical features through Conv blocks, C2f modules, and SPPF for multi-scale feature aggregation. The Neck performs bidirectional feature fusion via FPN+PAN, generating three feature maps P3, P4, and P5 at different resolutions. The Head uses decoupled branches for classification and bounding box regression. In SEDS-YOLOv8, we modify this baseline by inserting SE attention modules in the neck’s C2f blocks, replacing select Conv/C2f layers with DSConv, and adopting EIoU loss for supervision, while preserving the core YOLOv8 pipeline structure.

Fig. 1.

Fig. 1

YOLOv8 network structure.

The Input layer applies Mosaic augmentation, disabled in the final 10 epochs for stability, and resizes images to 640Inline graphic640. YOLOv8 adopts an anchor-free paradigm, directly predicting box centers and dimensions without predefined anchors, simplifying non-maximum suppression and accelerating inference. The C2f module improves gradient flow through cross-stage connections while maintaining computational efficiency.

The Backbone extracts hierarchical features using CSPDarknet-53 principles. It replaces YOLOv5’s C3 with C2f modules to reduce parameters while preserving representational capacity. Key components include Conv blocks that combine convolution, batch normalization, and SiLU activation; C2f modules for multi-scale feature extraction; and SPPF for spatial pyramid pooling and fixed-size feature aggregation.

The Neck fuses features bidirectionally through FPN+PAN architecture. FPN performs top-down upsampling to enrich high-resolution features with semantic information, while PAN conducts bottom-up downsampling to enhance low-resolution features with spatial detail. This produces three feature maps P3, P4, and P5 at different scales for multi-scale detection.

The Head uses decoupled branches for classification and regression. The classification branch predicts object categories, while the regression branch predicts bounding boxes using task-aligned losses to jointly optimize confidence and IoU. Dynamic sample assignment ensures efficient training across scales.

YOLOv8 optimizations and improvements

SEDSConv Net network improvement method

SEDS-YOLOv8 addresses water-surface detection challenges through three integrated modifications to the YOLOv8n baseline: SE attention modules for channel recalibration to suppress reflection noise; DSConv for computational efficiency without accuracy loss; and EIoU loss for stable geometric regression under localization drift. These task-specific adaptations target environmental interference from reflections and ripples, along with computational constraints inherent to real-time monitoring.

SE attention mechanism

We employ Squeeze-and-Excitation blocks to suppress reflection-induced channel noise. SE acts on each feature map by squeezing spatial information to a compact channel descriptor, then exciting channels with data-driven weights, followed by channel-wise rescaling. Inserted into the neck’s C2f blocks, SE adds few parameters with reduction ratio Inline graphic, yet reliably enhances robustness to specular highlights.

Squeeze (global context)

For an input Inline graphic, Global Average Pooling (GAP) 33 produces a channel descriptor z:

graphic file with name d33e375.gif 1

Excitation (channel dependencies)

A light two-layer bottleneck predicts channel weights s:

graphic file with name d33e386.gif 2

with Inline graphic and Inline graphic.

Scale (recalibration)

Channels are reweighted as Inline graphic, amplifying informative responses and attenuating noisy ones. Figure 2 illustrates the complete SE block architecture.

Fig. 2.

Fig. 2

SE block: GAP squeeze Inline graphic FC–ReLU–FC excitation Inline graphic sigmoid weights Inline graphic channel-wise scaling. We place SE in the neck’s C2f blocks to mitigate reflection noise.

The resulting SE-enabled neck is shown in Fig. 3.

Fig. 3.

Fig. 3

SE-YOLOv8 Network Architecture with SE inserted into three C2f blocks before the Detect head.

DSConv

Distribution Shift Convolution (DSConv) 34 reduces computational requirements and memory consumption through quantization and distribution shift mechanisms. The layer design decomposes conventional floating-point convolution kernels into two main components: the variable quantization kernel (VQK) and the distribution shifter. VQK is an integer tensor with the same structure as the native convolution kernel, having dimensions (cho, chi, k, k), where cho represents the number of output channels, chi is the number of input channels, and k represents the convolution kernel size. In our training-from-scratch setting, VQK parameters are initialized by quantizing floating-point weights drawn from standard neural initialization and are optimized jointly through the learned distribution shifters, enabling fast and efficient integer arithmetic while reducing computational complexity.

DSConv employs VQK to store quantized integer weights, reducing storage and accelerating operations. The distribution shifter comprises kernel distribution shift (KDS) and channel distribution shift (CDS). KDS applies per-block distribution offsets to VQK sub-blocks (dimensions: 1 Inline graphic BLK Inline graphic 1 Inline graphic 1, where BLK is the block size hyperparameter), calibrated via a single-precision tensor of size 2 Inline graphic (ceil(chi/BLK), cho, k, k) to match original floating-point weight distributions. CDS operates on channel dimensions, generating a 2 Inline graphic cho tensor to adjust channel-level distributions, enabling output feature maps to approximate traditional floating-point convolution results.

The quantization process of DSConv converts network weights into integer representations of fixed bit-width to reduce storage and computation costs. For a quantization bit-width b, the quantized weight Inline graphic is represented using signed integers in 2’s complement form:

graphic file with name d33e481.gif 3

where Inline graphic represents the quantized integer value of each weight parameter, and b denotes the number of quantization bits. The quantization procedure adjusts the weights of each convolutional layer such that the maximum absolute value of the original floating-point weights aligns with the maximum representable value defined by the quantization constraint. All weight parameters are then converted to the nearest integer and stored as integer values in memory for subsequent training and inference. This per-block quantization process enables flexible architectural design, where the shift factor selects the optimal range of integer values for weight representation.

The purpose of distribution shift is to make the model output as close as possible to the performance of the original floating-point model by adjusting quantized weights (VQK). This process involves two main components: kernel distribution shift (KDS) and channel distribution shift (CDS). We denote Inline graphic as the scaling factor, Inline graphic as the offset term, and Inline graphic and Inline graphic as the corresponding CDS parameters.

To ensure that model performance does not substantially degrade after quantization while maintaining optimization speed and efficiency, two methods are proposed for initializing Inline graphic values: minimizing KL divergence and minimizing L2 norm. The former ensures that VQK maintains distribution characteristics similar to the original weights after KDS transformation. The latter identifies scaling factors that minimize the Euclidean distance between the quantized and original tensors.

For channel shifters, initial values are set to 1 for the multiplier and 0 for the bias, because the initial quantization step applies scaling only to VQK and does not involve channel-level adjustments. This configuration enables the channel shifter to serve as a fine-tuning mechanism during subsequent training, primarily to restore or improve model accuracy by ensuring that activation patterns match the original convolution operation results as closely as possible.

DSConv reduces memory footprint and increases processing speed through faster integer operations. In SEDS-YOLOv8, we replace select Conv and C2f layers in the backbone and neck with DSConv2D and C2F_DSConv, integrating SE attention modules after C2F_DSConv blocks. This design addresses computational redundancy while maintaining representational capacity, as shown in Fig. 4.

Fig. 4.

Fig. 4

SEDSConv network structure.

EIoU loss function

We adopt the Enhanced Intersection over Union (EIoU) loss function to address bounding box regression challenges in aquatic datasets, replacing CIoU loss. EIoU decouples width and height penalties, applying independent constraints on aspect ratios and centroid alignment to mitigate positional uncertainties caused by ripple-induced localization drift. This formulation stabilizes bounding box regression under water-surface conditions.

YOLOv8 uses a combination of DFL Loss and CIoU. CIoU incorporates three components: overlapping area, centroid distance, and aspect ratio consistency. The CIoU loss formulation, as defined in Eq. (4), introduces an aspect ratio consistency term through parameter Inline graphic (balancing factor) and metric v (aspect ratio similarity measure).

graphic file with name d33e547.gif 4

where b and Inline graphic denote the center points of the predicted and ground truth bounding boxes, Inline graphic represents the Euclidean distance, c is the diagonal length of the smallest enclosing box covering both boxes, IoU is the Intersection over Union, and the aspect ratio similarity v and balancing factor Inline graphic are computed as:

graphic file with name d33e578.gif 5
graphic file with name d33e582.gif 6

where w and h are the width and height of the predicted box, and Inline graphic and Inline graphic are the width and height of the ground truth box.

CIoU extends traditional IoU by addressing non-overlapping cases through gradient-aware optimization, mitigating the vanishing gradient problem. However, CIoU’s weighting factor Inline graphic depends on both aspect ratio similarity v and Inline graphic, creating imbalanced optimization: when IoU approaches unity, Inline graphic diminishes excessively, de-emphasizing aspect ratio consistency. Additionally, CIoU’s coupled aspect ratio treatment inadequately captures independent width/height contributions to localization errors, leading to suboptimal convergence for water-surface detection where ripple-induced drift affects width and height independently.

To address these limitations, we propose Enhanced IoU (EIoU) loss, which decouples width and height penalties to improve geometric regression stability.

The EIoU (Enhanced Intersection over Union) loss introduces two key innovations:

First, EIoU decomposes the coupled aspect ratio influence factor in CIoU into independent width and height penalty terms, measuring discrepancies between predicted and ground truth boxes:

graphic file with name d33e624.gif 7

where w and h denote the width and height of the predicted box, Inline graphic and Inline graphic are the width and height of the ground truth box, Inline graphic and Inline graphic represent the width and height of the minimum enclosing box covering both the predicted and ground truth boxes, and Inline graphic denotes the Euclidean distance.

Secondly, the width and height penalty terms are normalized by the enclosing box dimensions Inline graphic and Inline graphic, incorporating a scale-aware mechanism that makes loss values comparable across objects of different sizes. The center point distance penalty term retains normalization but with reduced weight through dimensional decoupling.

The complete EIoU loss function is formulated as:

graphic file with name d33e668.gif 8

where Inline graphic represents the IoU loss component (Inline graphic), Inline graphic denotes the normalized distance loss between predicted and ground truth box centers, and Inline graphic captures the decoupled aspect ratio loss through independent width and height penalties. This approach retains the effective properties of CIoU loss while directly reducing width and height discrepancies between predicted and ground truth boxes, leading to faster convergence and improved localization accuracy. A visual comparison of CIoU and EIoU loss formulations is illustrated in Fig. 5.

Fig. 5.

Fig. 5

CIoU and EIoU knowledge map.

Experimental setup and results

Dataset and experimental configuration

We constructed a 28,000-image dataset by augmenting a 5,000-image raw corpus. The raw set comprises 3,000 images from ORCA-UBOAT 35 (60%) and 2,000 web-scraped river/lake scenes (40%). Images cover diverse weather, lighting, and water-surface states typical of urban rivers and lakes, captured with mobile devices. The dataset contains eight object categories: plastics, organic debris, styrofoam, metal cans, glass, paper, textiles, and other floating debris. Representative samples appear in Fig. 6. Images were annotated with a two-stage protocol, Cohen’s kappa 0.87. We split the data into train/val/test as 70/20/10. The 28,000 images are obtained by applying the stochastic augmentation policy in Table 1 to the 5,000-image raw set; the same policy is used during training.

Fig. 6.

Fig. 6

Representative samples from the hybrid dataset across objects (plastics, organic debris, styrofoam, metal cans) and conditions (weather, lighting, and water states), illustrating real-world diversity.

Table 1.

Data augmentation operations and key settings.

Operation Settings
Photometric Brightness ±20%, contrast ±15%, saturation ±10%
Blur/Noise Gaussian Inline graphic; salt–pepper
Geometry Rotate ±15Inline graphic, flip, random crop
Resize 640Inline graphic640
Mosaic On; off in last 10 epochs

Experiments were conducted on Ubuntu 22.04 with NVIDIA RTX 4090 GPU (24GB), Python 3.8, PyTorch 2.0.0, CUDA 11.8, and cuDNN 8.9. All models were trained from scratch without pretrained weights under identical configurations (Table 2). We evaluate performance using precision, recall, and mean average precision (mAP), with detections validated at IoU threshold 0.5. Ablation studies use YOLOv8n (3.01M parameters, 8.20 GFLOPs) as the consistent baseline.

graphic file with name d33e786.gif 9
graphic file with name d33e790.gif 10

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively.

Table 2.

Experimental configuration parameters.

Settings Parameters Settings Parameters
Epochs 300 Early stopping (patience) 10
Image size 640 Warmup epochs 5
Batch size 32 Cache disk
Workers 16 Learning rate 0.001
Weight decay 0.0005 Optimizer AdamW

Experimental results

We evaluate SEDS-YOLOv8 through comparative experiments against five YOLO variants and systematic ablation studies. Training curves comparing mAP@0.5 and mAP@0.5:0.95 across all models throughout 300 epochs are shown in Fig. 7, demonstrating SEDS-YOLOv8’s consistent performance advantage and stable convergence.

Fig. 7.

Fig. 7

Training curves of mAP@0.5 (left) and mAP@0.5:0.95 (right) over 300 epochs.

Qualitative analysis

Figure 8 shows tight localization and low false-alarm rates across diverse scenes. SE suppresses reflection noise, improving debris vs. highlight discrimination. EIoU aligns aspect ratios better on elongated objects under ripples. DSConv preserves spatial detail without degradation. Remaining errors occur on near-transparent films, cluttered foam/branch backgrounds, and extreme low-light scenes.

Fig. 8.

Fig. 8

Qualitative detection results on representative test images.

Table 3 demonstrates SEDS-YOLOv8’s superior performance across all metrics. Compared to YOLOv9-gelan-c (84.70% mAP@0.5, 25.42M parameters, 102.5 GFLOPs), SEDS-YOLOv8 achieves +4.12 pp higher mAP@0.5 with only 11.4% of parameters and 7.4% of FLOPs, while maintaining 103.7 FPS. Against lightweight variants (YOLOv5n, YOLOv10n, YOLOv11n), improvements range from 9.91 to 15.71 pp in mAP@0.5. The performance gap with YOLOv10n/YOLOv11n reflects our training-from-scratch protocol without pretrained weights.

Table 3.

Comparison across YOLO variants under identical training settings (YOLOv8n as baseline).

Model P R mAP@0.5 mAP@0.5:0.95 Params FLOPs FPS
(%) (%) (%) (%) (M) (G)
YOLOv8n(baseline) 82.16 77.46 80.91 49.28 1.77 4.3 312
YOLOv5n 81.16 75.46 78.91 47.28 1.77 4.3 312
YOLOv7-tiny 81 77.55 80.14 48.74 6.02 13.1 333.3
YOLOv9 (gelan-c) 84.05 81.23 84.7 60.7 25.42 102.5 99
YOLOv10n 77.94 69.6 73.11 45.38 2.71 8.4 142.8
YOLOv11n 80.15 72.39 75.01 47.79 2.59 6.4 148.4
SEDS-YOLOv8n 86.02 85.01 88.82 62.09 2.9 7.6 103.7

Ablations (Table 4) use YOLOv8n as baseline (mAP@0.5:0.95: 49.48%). SE adds +1.20 pp but halves FPS (180.6 Inline graphic 95.1). DSConv adds +0.80 pp with  -3.6% parameters and  -8.5% FLOPs. EIoU contributes most (+10.91 pp, to 60.39%). Combined, SE+EIoU reaches 61.67%; the full model (SE+DS+EIoU) attains 62.09% mAP@0.5:0.95 and 88.82% mAP@0.5 at 103.7 FPS. Gains are mainly cumulative rather than synergistic, while DSConv offsets SE’s cost.

Table 4.

Ablation study results with YOLOv8n baseline.

Variant SE DS EIoU P R mAP@0.5 mAP@0.95 Params FLOPs FPS
(%) (%) (%) (%) (M) (G)
Baseline (YOLOv8n) 0 0 0 80.5 72.97 77.45 49.48 3.01 8.2 180.6
SE only 1 0 0 81.35 74.82 78.63 50.68 3.02 8.3 95.1
DSConv only 0 1 0 80.95 73.54 78.21 50.28 2.89 7.5 185.3
EIoU only 0 0 1 85.26 83.89 87.89 60.39 3.15 8.9 182.4
SE + EIoU 1 0 1 86.45 85.23 88.63 61.67 3.16 8.9 96.3
SEDS (Full) 1 1 1 86.02 85.01 88.82 62.09 2.9 7.6 103.7

SE squeeze-and-excitation attention (“SE attention mechanism”), DS distribution shift convolution, EIoU enhanced IoU loss (“EIoU loss function”).

Conclusion

We presented SEDS-YOLOv8, a task-focused refinement of YOLOv8n for water-surface debris detection. By combining SE, DSConv and EIoU, the model achieves 88.82% mAP@0.5 with 2.90M parameters and 7.60 GFLOPs—7.3% fewer than YOLOv8n–while running at 103.7 FPS. The results surpass recent YOLO variants and illustrate the value of scene-tailored design.

Our ablations clarify the role of each component: SE recalibrates channels to suppress reflections, DSConv improves efficiency, and EIoU stabilizes box regression. Gains are largely additive rather than synergistic. The +12.61 pp increase in mAP@0.5:0.95 confirms the effectiveness of targeted adaptation over generic designs.

We also note several limitations: (i) Environmental constraints–performance drops under extreme highlights, heavy waves, or severe turbidity where reflections dominate; (ii) Generalization–cross-device and cross-viewpoint robustness requires broader validation across cameras, viewpoints and lighting; (iii) Small objects–dense clusters of tiny debris remain difficult in adverse weather; (iv) Training assumptions–the model assumes lighting and surface characteristics consistent with the training distribution, limiting transfer under large shifts; (v) Annotation quality–accuracy depends on labels, and ambiguous cases (e.g., partially submerged objects) can affect reliability.

SEDS-YOLOv8 is well suited to real-time pollution monitoring, automated collection and water-quality assessment. Deployment can benefit from polarization or multi-modal sensing to mitigate extreme reflections, and from domain adaptation and knowledge distillation on edge devices. Future work will target stronger robustness in adverse conditions, wider generalization, and better small-object detection. The demonstrated accuracy–efficiency balance supports practical field use.

Acknowledgements

We thank the anonymous reviewers for their valuable comments and suggestions that helped improve the quality of this paper. This work was supported by two projects: Tianjin Municipal College Students Innovation and Entrepreneurship Project (Project No. 202410069138) and Tianjin Municipal College Students Innovation and Entrepreneurship Project (Project No. 202510069084).

Author contributions

Yan Peng Cao: Writing–original draft, Software, Methodology, Investigation. HaoWen Luo: Writing–original draft, Materials, Software, Investigation. Meng Di Wang: Writing–review & editing, Supervision. Yue Wang: Review & editing, Supervision. Hao Yan: Review & editing, Supervision.

Funding

This work was supported by two projects: Tianjin Municipal College Students Innovation and Entrepreneurship Project (Project No. 202410069138) and Tianjin Municipal College Students Innovation and Entrepreneurship Project (Project No. 202510069084).

Data availability

This research has been stored in the github repository for public access.https://github.com/hackerjackL/SEDS-YOLOv8.git.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

YanPeng Cao, Email: cyp1983242@163.com.

HaoWen Luo, Email: 1832259@aliyun.com.

MengDi Wang, Email: 746516470@qq.com.

References

  • 1.Neubeck, A. & Gool, L. J. Efficient non-maximum suppression. In Proceedings of the International Conference on Pattern Recognition. 850–855. 10.1109/ICPR.2006.479 (2006).
  • 2.He, K., Zhang, X., Ren, S. & Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.37(9), 1904–1916. 10.1109/TPAMI.2015.2389824 (2014). [DOI] [PubMed] [Google Scholar]
  • 3.Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587. 10.1109/CVPR.2014.81 (2014).
  • 4.Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.39(6), 1137–1149. 10.1109/TPAMI.2016.2577031 (2017). [DOI] [PubMed] [Google Scholar]
  • 5.Song, H. et al. Symmetrical learning and transferring: Efficient knowledge distillation for remote sensing image classification. Symmetry17(7), 1002. 10.3390/sym17071002 (2025). [Google Scholar]
  • 6.Song, H. et al. Cmkd-net: A cross-modal knowledge distillation method for remote sensing image classification. Adv. Sp. Res.10.1016/j.asr.2025.01.015 (2025). [Google Scholar]
  • 7.Song, H. et al. Optimized data distribution learning for enhancing vision transformer-based object detection in remote sensing images. Photogramm. Rec.40(189), 70004. 10.1111/phor.12345 (2025). [Google Scholar]
  • 8.Song, H., Yuan, Y., Ouyang, Z., Yang, Y. & Xiang, H. Quantitative regularization in robust vision transformer for remote sensing image classification. Photogramm. Rec.39(186), 340–372. 10.1111/phor.12322 (2024). [Google Scholar]
  • 9.Redmon, J. & Farhadi, A. Yolo9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6517–6525. 10.1109/CVPR.2017.690 (2017).
  • 10.Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
  • 11.Bochkovskiy, A., Wang, C. Y. & Liao, H. Y. M. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
  • 12.Wang, C. Y., Bochkovskiy, A. & Liao, H. Y. M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022).
  • 13.Wang, C. Y., Yeh, I. H. & Liao, H. Y. M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv preprint arXiv:2502.00967 (2025).
  • 14.Wang, A. et al. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst.37, 107984–108011. 10.5555/3454287.3455398 (2024). [Google Scholar]
  • 15.Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision. 9627–9636. 10.1109/ICCV.2019.00488 (2019).
  • 16.Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. & Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
  • 17.Liu, S., Qi, L., Qin, H., Shi, J. & Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6829–6838. 10.1109/CVPR.2018.00713 (2018).
  • 18.Yu, J., Jiang, Y., Wang, Z., Cao, Z. & Huang, T. Unitbox: An advanced object detection network. In Proceedings of the ACM Multimedia Conference. 516–520. 10.1145/2964284.2967274 (2016).
  • 19.Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R. & Ren, D. Distance-iou loss: Faster and better learning for bounding box regression. arXiv preprint arXiv:1911.08287 (2019).
  • 20.Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J. & Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. arXiv preprint arXiv:2006.04388 (2020).
  • 21.Zhang, Y. F. et al. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing506, 146–157. 10.1016/j.neucom.2022.07.038 (2022). [Google Scholar]
  • 22.Ul Amin, S., Kim, B., Jung, Y., Seo, S. & Park, S. Video anomaly detection utilizing efficient spatiotemporal feature fusion with 3D convolutions and long short-term memory modules. Adv. Intell. Syst.6(7), 2300706. 10.1002/aisy.202300706 (2024). [Google Scholar]
  • 23.Ul Amin, S., Kim, Y., Sami, I., Park, S. & Seo, S. An efficient attention-based strategy for anomaly detection in surveillance video. Comput. Syst. Sci. Eng.46(3). 10.32604/csse.2023.034805 (2023).
  • 24.Ul Amin, S., Sibtain Abbas, M., Kim, B., Jung, Y. & Seo, S. Enhanced anomaly detection in pandemic surveillance videos: An attention approach with efficientnet-b0 and cbam integration. IEEE Access12, 162697–162712. 10.1109/ACCESS.2024.3488797 (2024). [Google Scholar]
  • 25.Amin, S. U., Jung, Y., Fayaz, M., Kim, B. & Seo, S. Enhancing pine wilt disease detection with synthetic data and external attention-based transformers. Eng. Appl. Artif. Intell.159, 111655. 10.1016/j.engappai.2025.111655 (2025). [Google Scholar]
  • 26.Yang, X. et al. Detection of river floating garbage based on improved yolov5. Mathematics10(22), 4366. 10.3390/math10224366 (2022). [Google Scholar]
  • 27.Li, D., Liu, H., Yang, Y., Fan, Z. & Wu, B. Research on small target detection technology for river floating garbage based on improved yolov7. In Proceedings of the 2023 International Conference on Intelligent Sensing and Industrial Automation. 30. 10.1145/3632314.3632349 (2023).
  • 28.Sio, A. G., Guantero, D. & Villaverde, J. Plastic waste detection on rivers using yolov5 algorithm. In Proceedings of the 2022 13th International Conference on Computing Communication and Networking Technologies. 1–6. 10.1109/ICCCNT54827.2022.9984439 (2022).
  • 29.Pang, C. & Cheng, Y. Detection of river floating waste based on decoupled diffusion model. In Proceedings of the 2023 8th International Conference on Automation, Control and Robotics Engineering. 57–61. 10.1109/CACRE58689.2023.10208741 (2023).
  • 30.Yu, H., Li, X. & Han, S. Y. Multiple attentional path aggregation network for marine object detection. Appl. Intell.53(2), 2434–2451. 10.1007/s10489-022-04305-4 (2023). [Google Scholar]
  • 31.Yang, C., Zhang, C., Jiang, L. & Zhang, X. Underwater image object detection based on multi-scale feature fusion. Mach. Vis. Appl.35(6), 124. 10.1007/s00138-024-01606-3 (2024). [Google Scholar]
  • 32.Zhong, W. et al. Hierarchical evidence aggregation in two dimensions for active water surface object detection. Vis. Comput.41(7), 1–16. 10.1007/s00371-022-02505-2 (2022). [Google Scholar]
  • 33.Lin, M., Chen, Q. & Yan, S. Network in network. arXiv preprint arXiv:1312.4400 (2013).
  • 34.Nascimento, M. G. D., Prisacariu, V. & Fawcett, R. Dsconv: Efficient convolution operator. In Proceedings of the IEEE International Conference on Computer Vision. 5734–5743. 10.1109/ICCV.2019.00634 (2019).
  • 35.Cheng, Y., Jiang, M., Zhu, J. & Liu, Y. Are we ready for unmanned surface vehicles in inland waterways? The Usvinland multisensor dataset and benchmark. IEEE Robot. Autom. Lett.6(2), 3964–3970. 10.1109/LRA.2021.3067250 (2021). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This research has been stored in the github repository for public access.https://github.com/hackerjackL/SEDS-YOLOv8.git.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES