A fusion approach of YOLOv8 and CNN-Transformer for End-to-End road anomaly detection

Sarfaraz Abdul Sattar Natha; Mohammad Siraj; Saif A Alsaif; Fahad Farooq; Admali Shah; Maqsood Mahmud

doi:10.1038/s41598-025-29718-4

. 2025 Nov 25;15:45341. doi: 10.1038/s41598-025-29718-4

A fusion approach of YOLOv8 and CNN-Transformer for End-to-End road anomaly detection

Sarfaraz Abdul Sattar Natha ^1,^✉, Mohammad Siraj ^2,^✉, Saif A Alsaif ², Fahad Farooq ¹, Admali Shah ², Maqsood Mahmud ³

PMCID: PMC12748596 PMID: 41291125

Abstract

Surveillance cameras are common in both the private and public sectors for security and monitoring, and closed-circuit television (CCTV) systems are used for surveillance, generating large amounts of video data that cannot be manually monitored 24/7. The traditional approach to analysis is time-consuming and inefficient, and there is a growing need for automated surveillance systems that can recognize and classify anomalies. The research area that has been the most challenging to solve is AD systems that detect anomalies in data that is not structured according to the normal patterns. RNNs are slow and have difficulty identifying anomalies in the road that occur in multiple frames at the same time, whereas CNNs are limited in extracting temporal features from objects and generally disregard the background noise in video frames. In this study, a new framework for background removal is presented that removes the irrelevant background elements during object recognition. This framework saves temporal and spatial information over frames and uses YOLOv8 and a spatial-temporal adaptive fusion method with an end-to-end model based on a CNN encoder and a Transformer decoder for parallel video investigation. The proposed method was tested on the UCF Crime dataset and a custom Road Anomaly Dataset (RAD), and the accuracy of the framework was 89.90% on the UCF Crime dataset and 98.28% on the RAD dataset.

Keywords: Smart transportation system, CNN, Deep learning, Transfer learning, YOLO, Road anomaly detection

Subject terms: Engineering, Mathematics and computing

Introduction

Security has become an important concern in several areas where crime rates are rising. Surveillance devices not only create a significant amount of surveillance images, but they also do not provide people with a complete security guarantee¹. The development of CCTV surveillance has been used globally to maintain public safety, help law enforcement, increase personal security, solve crimes, identify traffic violations, and manage traffic² with advances in networks, storage, and processing systems. The massive amounts of real-time video data, manual analysis by humans is impractical and time-consuming; therefore, we needed automated technologies to quickly and effectively analyze video data are required for CCTV systems to be effective. These abnormal events include road accidents, explosions, theft, physical fights, and robbery³.These are the most difficult to identify in surveillance videos. Anomaly detection (AD) systems have become one of the most challenging areas of research into identifying data that deviates from standard patterns⁴. Various sources can be used to analyze and interpret the context of abnormal events^5,6. Anomaly detection (AD) identifies anomalies in data⁷. In smart cities, AD systems enhance security by detecting abnormal activity and associating the event with the location. AD has a wide range of applications in fraud detection⁸, fault diagnosis⁹, and human activity recognition¹⁰, network sensor monitoring¹¹, and medical imaging¹².However the application of AD in real-world environments is still challenging¹³, particularly in smart cities, as dynamic and complex conditions exacerbate the challenge¹⁴. AD systems are important for situational awareness, but the lack of annotated data continues to limit accurate anomaly detection^15,16. Another important challenge is data complexity: The factors contributing to anomalous events make it difficult to detect abnormal patterns in an intelligent environment^17,18. The imbalanced nature of AD problems also poses a unique assessment challenge, and the selection of the appropriate evaluation metrics to address the imbalance is crucial¹⁹. The aim of this study is to accurately localize anomalous events in both space and time in a sequence of frames. Determining a global boundary or model that captures all the normal behaviors is difficult²⁰. The most challenging anomalies to identify are intelligent anomalies, or adversarial samples, which are designed to appear like normal patterns. Here, we present an advanced framework that integrates deep learning with background subtraction to detect anomalies on roads. Background subtraction produces bounding box masks, which successfully isolate moving objects from the scene, and a CNN encoder captures the spatial information (e.g., object size and location) to model how these objects function.

The suggested framework uses YOLOv8s to automatically eliminate road backgrounds, reducing interference and increasing the precision of object detection.
This fusion approach uses a CNN encoder and a Transformer decoder to extract spatiotemporal dependencies between objects to precisely detect road anomalies, as opposed to earlier methods that aggregate or concatenate spatiotemporal features. This technique takes video sequences at various time intervals and extracts discriminative information.
This framework also allows for the parallel processing of consecutive video frames to scale large and complex datasets and enable real-time detection across multiple frames.
The experimental evaluation demonstrates that this framework is reliable and performs better than earlier approaches, achieving 98.28% accuracy on the RAD dataset and 89.90% accuracy on the UCF Crime dataset.

This paper is organized as follows: Sect. 2 provides a summary of current developments in road anomaly detection. Section 4 presents the experimental results obtained from the model, while Sect. 3 describes the suggested framework. Section 5 brings the study to a close and outlines potential directions for future research.

Related work

In this session, an overview of key concepts and techniques in the latest methods for anomaly detection, specifically applied to road anomalies, will be provided.

Machine learning approach

Traditional approach methods for anomaly detection typically involve machine learning approaches. The SVM and K-Nearest Neighbor (KNN) algorithms are categorized for driver behavior at traffic signals, specifically examining safe versus unsafe stopping responses at the yellow light in dilemma zones throughout rural, suburban, and urban areas²¹. KNN and Linear Discriminant Analysis achieve accuracies of 90.1% and 89.4% respectively. However, the cubic kernel underperforms, and the Gaussian kernel’s high computational demand may hinder real-time use. The model uses a Random Forest classifier to predict accidents by differentiating between accidents and non-accidents. In this study²². The authors proposed a model that makes use of historical weather and accident data gathered from the road network. In addition to a remarkable specificity of 0.97 and a low false positive rate of just 3%, the findings of this technique show an accuracy rate of 73%. However, the model’s very low sensitivity of 0.08 suggests that it would miss a sizable portion of possible accidents.

K-Means clustering is applied to cluster the road accident variables based on the similarity²³, which improves the dataset by generating another feature for training. The features are then classified into severity levels using Random Forest, which performs better than traditional models (SVM, KNN, and Logistic Regression). Elakiya et al.²⁴ suggested a method for detecting behavioral patterns based on the combination of KNN with median filtering. Similarly, Chriki et al.²⁵ presented a framework that combines deep features from GoogleNet with traditional hand-crafted descriptors, such as HOG and HOG3D, and is executed using a structured process from dataset collection to UAV surveillance.

Deep learning approach

Traditional methods are combined with new deep learning-based methods. In this study, Shoaib et al.²⁶, proposed a technique that improves detection of important changes in motion regions in video frames using background removal and an attention mechanism, which is able to classify events as normal or abnormal on the UCF Crime public dataset with 96.89% accuracy using a 3D CNN. While this strategy works, it comes with several problems. The approach is sensitive to the quality of the input data, as is the case in real-world applications. In an effort to develop more effective and robust surveillance systems, two new models were recently suggested to optimize detection performance and maintain flexibility across different scenarios²⁷. R. Nawaratne et al. also recommended a deep learning-based technique for real-time video surveillance anomaly identification and localization²⁸. Although the approach is unique, its middling detection accuracy of 85% suggests that performance might be enhanced. A probability model efficiently extracts the size, velocity, and position features of video frames by assigning weights to expected particles based on how likely they are to be associated with anomalies²⁹. The framework achieves lower processing time and Equal Error Rate (EER) than current state-of-the-art algorithms in testing on the UCSD and LIVE datasets. The authors of the study³⁰ deployed CNNs that were first trained for object classification to a variety of tasks, including scene classification, visual instance retrieval, and attribute detection. A 2D CNN that had previously been trained on image classification datasets was modified in another work³¹. to extract features from various regions of input images. In a similar manner, in this study, U. Arul et al.³² present an adaptive recurrent neural network designed for detecting anomalies in video images. This innovative approach merges a recurrent neural network with a crystal structure algorithm. This method is removed after development, which makes it easier to identify the frames. Long Short-Term Memory (LSTM) networks are frequently used for identifying departures from typical designs in various fields due to their ability to process time-series data with resilience³³. F. Ding et al.³⁴ define the role of LSTMs are also effective for detecting dynamic anomalies due to their ability to model temporal and contextual dependencies. LSTM networks are suitable for trajectory-based anomaly detection, as they have been successful in modeling and predicting object trajectories. Studies using data on container movement have demonstrated the ability of LSTM autoencoders to detect navigational irregularities, which are crucial to maritime safety. Another approach is to use LSTMs to predict normal trajectory patterns where deviations constitute anomalies. However, complex backgrounds with many objects or people often pose challenges for LSTM-based approaches. Motion-related anomalies were detected using a one-class SVM (OCSVM). Connected component analysis was employed to reduce false positives from unexpected motion in low-motion regions, and AUC scores of 76.08 (pixel level) and 97.53 (frame level) were achieved. Furthermore, the authors³⁵ proposed the method that compactness and feature representation are critical in image classification and feature learning, as they improve efficiency and streamline the learning process. To address this, S. Lei et al.³⁶ introduced a multi-scale feature extraction framework that captures video data at various resolutions, added a Spatial Pyramid Convolution (SPC) module to enhance object recognition at different scales, and added a Weakly Supervised Data Augmentation Network (WSDAN) to improve input images through guided augmentation, which were then processed by a U-Net model, reaching AUC scores of 86.2% and 97.9% on the tested datasets. Nevertheless, it remains an expensive process with many influencing factors and is limited by the perception of the images. YOLO (You Only Look Once) is a popular real-time object detection method because it classifies and localizes in a single forward pass.

The use of ts for anomaly detection in traffic surveillance as well as in security and surveillance. Building on this, to avoid escalation and mitigating possible losses, the Ganagavalli group³⁷ implemented a system for video surveillance tracking of automated real-time crime detection and developed optimized algorithms for YOLO frameworks. This design enables the system to detect unlawful activities in real-time, alerting both the public and the authorities. Reporting AUC results of 0.8299 on 14 crime categories and 0.91 for vandalism detection, the system outperformed existing frameworks in precision, F1 score, training loss, and testing loss. Feature extraction sometimes produces false positive results due to capturing poorly relevant information. Using self-attention methods, the Vision Transformer (ViT) is able to detect objects by deciphering global relations within an image. For example, Transformer-based Tailing D was suggested by³⁸. TSViT-B/512 variant surpassed the best-performing basic CNN (which obtained 63.54% accuracy) by increasing recall from 91.92% with ResNet-101 to 98.05% with 76.56% accuracy. However, the model still presents shortcomings which, in part, due to the small sample size, may lead to problems with generalization in practical settings. For example, X. Yan et al.³⁹ Focused on separating foreground and background in images, and built a model with two encoders, a motion encoder capturing the differences in a sequence of images as motion input, and another encoder that receives the last frame of a static image. These issues highlight the need for further model refinements which aim to improve the accuracy and maintain a level of consistency in the performance across a diverse range of video data of varying temporal lengths. The Strengths and Limitations of previous work are summarized in Table 1.

Table 1.

Summary of prior studies’ strengths and limitations.

Author, Year, Reference	Strength	Limitation
E. Mujkic et al. 2022⁴⁰	This proposed study makes use of convolutional autoencoders to detect objects that deviate from common patterns. An autoencoder network can be trained to reconstruct typical patterns in agricultural fields with high reconstruction errors in order to identify unknown objects. The PR-AUC was 0.93 using this method.	Autoencoders were used in this model, and although they worked well, they were often lossy and required a lot of processing power. Inaccurate object detection was occasionally caused by the flawed decoding process.
J. W Lee et al. 2024⁴¹	Achieves strong accuracy across different datasets, is adaptable for various settings, offers flexible model complexity, and provides a valuable benchmark for future studies.	Resource-intensive for complex models, varying results on some datasets, limited testing under real-world conditions, and a lack of emphasis on real-time speed.
K. Razaee et al. 2024⁴²	In real-time surveillance, the technique effectively eliminates false positives and negatives with a high accuracy of 94.13% and excellent sensitivity and specificity. Robust anomaly detection is ensured by combining hand-crafted features with deep learning.	The high computational resource requirements of this approach, such as high-performance GPUs, may limit scalability in resource-constrained environments. Large, labeled samples are necessary for training, and dynamic or obstructed conditions may affect tracking effectiveness.

Open in a new tab

Methodology

The proposed frame uses YOLOv8s to detect objects and draw their bounding boxes after the video pre-processing isolates frames from the video frames and the Mask Generator creates the corresponding masks for those bounding boxes, which were separated from the background by the image preprocessor. This background subtraction method reduces the effect of excessive noise, which improves detection accuracy, but traditional detection methods fail due to background interference⁴³. The proposed approach overcomes this limitation by using a CNN encoder to capture spatial features from individual frames, such as object size and location, and a Transformer decoder to model temporal dependencies across sequences of frames, fused into a single end-to-end framework., is presented in Fig. 1.

Fig. 1 — The proposed architecture performs road anomaly detection through a bounding-box-masks extractor.

Bounding box masks extractor

The YOLOv8 object detection method allows real-time processing of video frames and can be used to navigate complex, crowded environments⁴⁴. This model works well for object detection. The Bounding-Box-Masks Generator is an automatic extraction of bounding box masks from the detected substances with the YOLOv8 technique⁴⁵, and Fig. 2 shows its architecture. Videos are processed to obtain frames. Videos are resized to 800 × 600 pixels at 30 frames per second.

Fig. 2 — The structure of the Bounding-Box-Masks Generator.

The part of the video that contains the detected Road Anomaly is cut to form a 40-frame clip, which lasts for 2 s and retains the original resolution. Each frame in this clip is manually labeled to indicate the presence of a Road Anomaly. The normalized frames, labeled as f1, f2,., fn, are fed into the YOLOv8s model within the Bounding Box Masks Generator to detect all moving objects in images⁴⁶. YOLOv8 utilizes bounding boxes of different sizes to identify substances of various scales and assigns a unique identity to each detected object. All perceived bounding boxes in a separate frame are labeled as o_n,1, …., o_{n, j−1}, o_j, ensuring that the respective object in the input frames is represented by a single bounding box, with j indicating the j^th bounding box. The organization of these bounding boxes is denoted as i_n,1, …., i_{n, j−1}, i_j are used to track the detected objects. A Mask Generator creates bounding box masks labeled m_n,1, …., m_{n, j−1}, m_j. The image preprocessor outputs a tuple of five values(x_{n, j}, y_{n, j}, w_{n, j}, h_{n, j}, c_{n, j}) representing one bounding box for every object. In this tuple, (x_{n, j}) and (y_{n, j}) are the coordinates of the center of the bounding box, while w_{n, j} and h_{n, j} indicate its width and height, respectively. The values for x_{n, j} and w_{n, j} range from 0 to 800 pixels, while y_{n, j} and h_{n, j} range from 0 to 600 pixels. The confidence score c_{n, j} indicates the likelihood of an object being present within the bounding box, with scores ranging from 0 to 1. To enable background subtraction, elements are classified into two categories: background and objects. Static components like Highways, buildings, and trees make up the background, whilst moving items like cars, people, and animals make up the objects. Using manually defined bounding boxes. The image preprocessing phase establishes object coordinates. Any bounding box that has a YOLOv8-calculated confidence score c_{n, j} less than 0.55 is initially ignored. The bounding box from neighboring frames is utilized as a stand-in if an item is recognized without one. Only the center coordinates x_{n, j}, y_{n, j} and dimensions w_{n, j}, h_{n, j} (width and height) of each bounding box are retained, while the confidence score is discarded.

Figure 3 provides an instance of a bounding box mask for an 800 × 600 video frame, where a detected object occupies a region of 85 × 240 pixels (length x width)⁴⁷. Within this area, object pixels are marked in white, and the outstanding pixels are blacked out. After background subtraction, each detected object is signified by a rectangular mask. These saved bounding box m_n,1, …, m_{n, j−1},m_j are then fed into the road anomaly detection system. The YOLOv8 model demonstrates high efficiency in locating road anomalies in video frames, which supports a robust detection step necessary for successful background subtraction. Also, the multi-head attention mechanism in the transformer decoder improves presentation, especially in cases involving multiple object interactions (Fig. 4). This flexibility demonstrates the framework’s capability to detect road anomalies under various lighting conditions. The diversity of the dataset provided a comprehensive and challenging range of scenarios for the evaluation of the road anomaly detection⁴⁸. The framework was able to detect road anomalies in frames from a variety of different environments and conditions, indicating robustness of the framework.

Fig. 3 — The sample images of the bounding box mask after background subtraction.

Road anomaly detector

The proposed model with a CNN encoder is composed of a series of convolutional and max-pooling layers that can process the 224 × 224 pixel input bounding box masks m_{n, j},. The convolutional layers employ 5 × 5 filters with 1 pixel padding to preserve the spatial resolution of the feature maps, and then max-pooling layers with a 3 × 3 filter and a stride of 2 to reduce spatial dimensions and highlight important features. In this architecture, each pair of convolutional layers is followed by a max-pooling layer that decreases the feature map dimensions while preserving critical information. This configuration surpasses the typical CNN structure by stacking several convolutional layers before each pooling operation, improving its capacity to collect spatial data⁴⁹. From each bounding box mask. The CNN encoder successfully extracts spatial features such as object location and size. The feature map sizes are gradually decreased by the encoder, from 115 × 115 to 55 × 55 and finally to 25 × 25. Once spatial features are extracted from all bounding box masks and then sent to the transformer decoder, which analyzes temporal relationships across the bounding box masks to complete anomaly detection⁵⁰. This iterative process of convolution and max pooling in the CNN encoder is expressed in Eq. 1.

Here, F represents the activation function, specifically the Rectified Linear Unit (ReLU), and M represents the max-pooling operation, convolutional weight matrix, and b bias vector. The process begins by applying the kth convolution (w_k × m_{n, j} + b_k) to the input m_{n, j}. The result is first passed through the activation function F, and then max-pooling M is applied. This procedure continues across all layers until the last convolution and pooling operations are completed. As shown in Fig. 5, the transformer decoder is built using the output from the CNN encoder as its input. The unseen features presented as h^m_{n, j} of m_{n, j} are generated. This decoder comprises a linear layer, a multi-head attention mechanism, and a feed-forward network. This study transformer decoder includes modifications that distinguish it from the initial Transformer model. The original Transformer architecture consists of both an encoder and a decoder. Additionally, two extra linear layers are incorporated within the decoder⁵¹. The first linear layer projects the hidden input features into a higher-dimensional space, adjusting them to 255 units for processing in the next layers. The final linear layer, made up of two neurons, serves as the output stage. It transforms the hidden features into a probability distribution across the target classes, providing classification results for multiple video frames of traffic incidents denoted as r₁, …, r_{n − 1}, r_n. The output of each frame is passed through a softmax layer, which limits the values to between 0 and 1, and a value close to 1 indicates a high probability of an anomaly in that frame, and a value close to 0 indicates a low probability. In this study, the multi-head attention mechanism performs parallel processing to derive temporal characteristics from hidden input representations, allowing the model to attend to information at various positions to capture a wide range of temporal dependencies. As stated in Eq. 2, a multi-head attention mechanism is used with eight heads, each with a hidden dimension of 32, in the transformer decoder.

Fig. 5 — The Transformer Decoder architecture.

Where h_q^m_{n, j}, h_k^m_{n, j}, and h_v^m_{n, j} denote the query, key, and value matrices. They are embedded in the unseen input features. The softmax function (.) is practical to the scaled dot result between the query h_q^m_{n, j} and key metrics h_k^m_{n, j}, and it is then multiplied by the value matrix h_v^m_{n, j}. In addition, d_hk^m_{n, j,} and the square root are used to scale the dot product. Here, T is the transpose symbol. To enhance learning stability by using a normalization function and a residual connection following the multi-head attention mechanism and feed-forward network. To modify the hidden features influenced by the attention weights. The feed-forward network includes two linear transformations, with a ReLU activation function.

InceptionV3

The Inception model is a type of CNN designed to improve deep neural networks by expanding their width and depth while efficiently utilizing computing resources. It was achieved by approximating a sparser network structure using dense matrix operations, which are well-suited to contemporary technology. The primary aim was to utilize readily available dense components as an approximation of an ideal local sparse design. In addition, the design applies projection and dimensionality reduction techniques to keep computational costs under control when they become excessive⁵². The architecture includes multiple Inception modules, each combining 1 × 1, 3 × 3, and 5 × 5 convolutional operations. The outputs from these different filters are joined together into a single feature vector, which is then passed as input to the next layer. These modules are arranged in sequence, with max-pooling layers (stride of 2) occasionally inserted between them. The structure implemented in this study is shown in Fig. 6.

Fig. 6 — Basic architecture of Inceptionv3.

Experimental assessment and performance

This section explains the experimental setup and datasets that serve as the basis for our investigation. A comparison is then carried out between our suggested model and several cutting-edge anomaly detection techniques. Furthermore, we provide the outcomes of a methodical assessment of numerous network elements using ablation investigations.

RAD dataset

The RAD dataset includes various videos and images collected from different sources to capture a wide range of road abnormalities such as road accidents, vehicle fire, snatching, and fighting, as shown in Fig. 7. Mobile cameras and surveillance devices were used to capture photographs and videos in several Pakistani cities. This is the first openly available dataset of road anomalies from the South Asia Region. There is variability in size, form, and environmental circumstances across the many examples in the collection. For the development of intelligent transportation and surveillance systems, this dataset is of great significance. Vehicle trajectory angles were recorded in the movies, displaying the routes and directions of the vehicles⁵³. Table 2 offers a summary of the RAD dataset.

Fig. 7 — Sample images of the RAD dataset.

Table 2.

Key characteristics of RAD dataset Videos.

File name	Video length	Image (width × height) pixels @ frames	Lighting Situation	Class of Anomaly
RA05	8 s	640 × 520 @30	low	Road accident
RA08	12 s	1280 × 720 @25	high	Road accident
RA06	15 s	1280 × 720 @30	low	Road accident
VF05	11 s	1920 × 1080 @24	low	Car fire
VF08	17 s	1280 × 720 @24	high	Car fire
Fi09	16 s	640 × 480 @30	high	Fighting
Fi10	18 s	640 × 480 @25	low	Fighting
Sn07	22 s	1920 × 1080 @25	low	Snatching
Sn10	19 s	640 × 480 @30	high	Snatching

Open in a new tab

UCF dataset

The University of Central Florida’s (UCF) dataset is used to measure how well the suggested approach works. Figure 8⁵⁴ shows 13 distinct sorts of real-world abnormal occurrences, such as abuse, arrest, arson, assault, accidents, burglary, explosions, fighting, robbery, shooting, and vandalism. It is a realistic and extensive dataset. The fact that many of the videos are long and contain multiple scenes makes it difficult to exactly detect and identify anomalous activity.

Model performance

Model performance is evaluated using several standard metrics, including the Receiver Operating Characteristic (ROC) curve, confusion matrix, accuracy, recall, precision, specificity, and F1-score. In anomaly detection (AD), the goal is to maximize true positives (TP) and true negatives (TN) while minimizing false positives (FP) and false negatives (FN). Here, TN refers to correctly identified normal instances, and TP indicates accurately detected anomalies. In contrast, FP arises when normal cases are misclassified as anomalies, and FN occurs when anomalies are incorrectly identified as normal. The mathematical definitions of these metrics are provided in Eqs. (3)–(7).

Experimental setup

The studies in this study were carried out using a Windows-based system that has an 12 GB NVIDIA RTX 3090Ti GPU, 16 GB of RAM, and a 10th-generation Intel Core i7 CPU. CUDA 12.3.x was used to optimize GPU processing, while PyTorch 1.10.0 was used to write the Python 3.6 source code for deep learning model construction. The UCF dataset and the Road Anomaly Dataset (RAD) are used to evaluate the proposed model, and both datasets are publicly available. To keep the data uniform and easier to handle. The classification accuracy was unaffected since just 30 frames were chosen from each movie to conserve memory, evenly distributed across the video. Bounding box masks downsized to 255 × 255 pixels made up the model’s input data. The dataset was artificially expanded in size and diversity using several techniques, such as flipping, zooming, and rotating, to aid the model in learning to identify various abnormalities. Additionally, the pixel values in the frame were normalized from 0 to 1 by dividing them by 255, which scales them to a value from 0 to 1. This prepares the model for more effective training. The dataset was split into 10% testing, 20% validation, and 70% for training the model. The model was trained for 50 epochs, with a batch size of 64 and a learning rate of 1 × 10⁻⁴. Other important hyperparameters were a learning rate that decreased over time using a cosine decay schedule and a dropout rate of 0.2 to prevent overfitting.

The model had softmax as its objective function and the Adaptive Moment Estimation (Adam) algorithm as its optimizer. We adjusted the design and testing during the design phase, and after testing on real-world data with other parameter settings, we found that this parameter arrangement (3 × 3 max pooling layer, stride of 2) produced the best results for anomaly detection in surveillance videos. While they are not high-dimensional feature maps in comparison to larger feature maps, 25 × 25 feature maps strike a good balance between feature extraction and processing efficiency. By employing max pooling with a 3 × 3 filter and a stride of 2, the model can focus on the most crucial variables while boosting processing speed and performance. This effectively shrinks spatial dimensions while keeping important components.

Data augmentation

By creating copies of the original data that undergo transformations, image augmentation eliminates the requirement for further real-world data collection. Common transformations in image augmentation are translation, flipping, and rotation, as well as changing color characteristics such as saturation, contrast, and brightness, which introduce uncertainty to the data and may enable the model to learn more attributes. We augmented the data with flipping, rotation, and translation, as shown in Fig. 9. Combining these augmentation methods, we hoped to enhance the number of models to generalize across contexts.

Fig. 9 — The sample images with the applied augmentation approach.

Results

UCF dataset results

The results for the four main categories (4MajCat) of the UCF dataset for the different deep learning models are presented in Table 3. The best model in terms of accuracy, precision, recall, F1-score, and specificity was InceptionV3 with 86.61%, 86.61%, 86.61%, and 86.61%, respectively. The performance evaluation of the proposed model in terms of accuracy, precision, recall, F1-score, and specificity for the four main categories (4MajCat) of the UCF dataset is given in Table 4, and it can be observed that the average accuracy, precision, recall, F1-score, and specificity were 89.90%, 89.91%, 89.90%, and 89.90%, respectively, thus validating that the proposed framework is efficient and robust in detecting abnormal events like collisions, explosions, altercations, and theft.

Table 3.

Using four main categories on the UCF dataset, the average performance results of different deep learning models.

Model	Accuracy(%)	Precision(%)	Recall (%)	F1-Score(%)	Specificity(%)
VGG19	76.50	76.50	76.51	76.50	76.51
ResNet50	78.68	78.69	78.68	78.68	78.69
MobileNetV2	82.44	82.45	82.45	82.44	82.44
InceptionV3	86.61	86.60	86.61	86.61	86.61
DenseNet201	84.85	84.85	84.85	84.85	84.85

Open in a new tab

Table 4.

The performance evaluation of the proposed model using four major categories on the UCF dataset.

Class	Accuracy(%)	Precision(%)	Recall (%)	F1-Score(%)	Specificity
Accident	89.91	89.91	89.90	89.90	89.90
Explosion	89.90	89.91	89.91	89.90	89.91
Fighting	89.90	89.90	89.91	89.90	89.90
Stealing	89.90	89.91	89.91	89.90	89.90
Average	89.90	89.91	89.91	89.90	89.90

Open in a new tab

The proposed approach is compared with other state-of-the-art methods in Table 6, and a more comprehensive evaluation of its performance on the UCF dataset is shown in Fig. 10. Figure 10(a) shows that the validation loss is initially at 0.56 and steadily decreases to 0.26 after about 50 epochs, Fig. 10(b) shows that the validation accuracy is initially at 0.65, peaks near the 50th epoch, and increases to 0.89, subfigure 11(c) presents the confusion matrix, which demonstrates the accuracy for each class and confirms the classification performance of the model, and subfigure 11(d) shows the ROC curve, which highlights the relationship between true-positive and false-positive rates and is used to evaluate the predictive ability of the model. Together, these figures provide a comprehensive analysis of the model’s performance.

Table 6.

Comparison of the proposed model AUC with prior studies.

Authors, Reference	Model	AUC(%)
A.O. Tur et al⁶³.	K-diffusion	65.20
K. Simonyan et al⁶⁴.	VGG-16	72.65
K. Biradar et al.⁶⁵	DEARESt	76.65
J. X.Zhong et al⁶⁶.	TSN-optical flow	78.09
Y.Tian et al.⁶⁷	RTFM	84.31
Proposed Model	CNN Encoder + Transformer Decoder	90.50%

Open in a new tab

Fig. 10 — (a). Accuracy Curve (b). Loss Curve (c). Confusion Matrix (d). ROC Curve.

Fig. 11 — (a). Accuracy Curve (b). Loss Curve (c). Confusion Matrix (d). ROC Curve.

The results shown in Table 5 show that, on the UCF dataset, the suggested model achieves the highest accuracy of 89.90%. As indicated in Table 6, the suggested approach outperforms the previously researched approaches with an AUC of 90.50%. These results demonstrate the suggested approach’s superior presentation and dependability in comparison to earlier research.

Table 5.

Using the UCF dataset, compare the accuracy of the suggested model with earlier state-of-the-art models.

Authors, Reference	Model	Accuracy(%)
Sudhakaran et al⁵⁵.	Convolutional LSTM	77.01
B. Cheng et al⁵⁶.	Flow Gated Network	87.26
Y. Su et al⁵⁷.	SPIL Convolutional Network	89.32
Z. Islam et al⁵⁸.	SepConvLSTM-M	89.70
G.A. Martínez et al⁵⁹.	3D CNN	75.71
A. Ansari et al⁶⁰.	Inceptionv3 + LSTM	74.53
W. Ullah et al.⁶¹	CNN + LSTM	78.43
I. Muneer et al.⁶²	InceptionV3 + BiLSTM	81.01
Proposed Model	CNN Encoder + Transformer Decoder	89.90

Open in a new tab

RAD dataset results

Table 7 shows the performance of the deep learning models on the RAD dataset, and Table 8 shows the proposed model performance on the same dataset, which shows better results in all classes with accuracy, precision, recall, F1-score, and specificity ranging from 98.28% to 98.29% with a significant superiority over other deep learning methods. A more thorough model evaluation of the RAD dataset is presented in Fig. 11, which shows that the validation loss (Fig. 11a) begins at 0.4 and steadily decreases until it reaches the minimum at around 50 epochs, while the validation accuracy (Fig. 11b) begins at 0.85 and quickly rises to its peak around 50 epochs before gradually increasing to 0.98, the confusion matrix (Fig. 11c) shows the classification accuracy for each class and confirms the model’s robustness, and the ROC (AUC) curve (Fig. 11d) shows the trade-off between true positives and false positives that reflects the effectiveness of the model. Figure 12 presents the test results of the proposed model.

Table 7.

The RAD dataset was used to evaluate the average performance of different deep learning models.

Models	Accuracy(%)	Precision(%)	Recall (%)	F1-Score(%)	Specificity(%)
VGG19	78.20	78.20	78.21	78.21	78.21
ResNet50	82.14	82.14	82.13	82.14	82.14
MobileNetV2	89.36	89.35	89.36	89.36	89.35
InceptionV3	90.54	90.55	90.55	90.54	90.54
DenseNet201	89.60	89.60	89.61	89.60	89.61

Open in a new tab

Table 8.

The performance evaluation of the proposed model on the RAD dataset.

Class	Accuracy(%)	Precision(%)	Recall (%)	F1-Score(%)	Specificity(%)
Accident	98.28	98.28	98.28	98.29	98.28
Car Fire	98.29	98.29	98.29	98.28	98.29
Fighting	98.28	98.28	98.29	98.29	98.28
Snatching(gunpoint)	98.28	98.29	98.28	98.28	98.28
Average	98.28	98.28	98.28	98.29	98.28

Open in a new tab

Fig. 12 — The suggested model’s test results.

Conclusion and future work

An innovative approach for detecting road anomalies is presented in this study, which consists of two main stages: Anomaly Detection and Bounding-Box mask extraction. The YOLOv8 model separates objects from the background using bounding box coordinates to create focused bounding box masks to enhance object clarity and reduce background noise. The Anomaly Detection stage employs a CNN encoder to extract spatial data from these bounding box masks via convolutional and max-pooling methods, which allows the Transformer decoder to determine whether an accident has taken place within each frame. In order to quantify the advantages of the backdrop reduction, the accuracy of the framework was measured with and without this functionality. Results indicate that background removal significantly improved the accuracy and robustness of the model. This system achieved an accuracy of 98.28% on the Road Anomaly Dataset (RAD) and 89.90% on the UCF Crime dataset, which is a better result than previously reported methods. Further improvements to improve the effectiveness of real-time detection and response, future work will concentrate on improving the robustness of the model by thoroughly testing its performance in different environmental conditions, such as rain, fog, and snow, to ensure accurate anomaly detection under different weather conditions, which would lead to a prompt response in real-time, dynamic scenarios.

Acknowledgements

This work was supported by the Ongoing Research Funding program (ORF-2025-893), King Saud University, Riyadh, Saudi Arabia.

Author contributions

Conceptualization, SN, and FF; methodology, SA, and MM; software, SN, and MS; validation, MM, and AS; formal analysis, MM, and AS; investigation, AS and MM; resources, SN and FF data curation, MS; writing original draft preparation, SN, and FF; writing—review and editing, MS, AS, and MM; visualization, MS, and SN; supervision, FF, SN, and AS.; project administration, MM, and FF; funding acquisition, SA, and MS.

Data availability

The datasets are publicly available, Road Anomaly Dataset UCF (Crime dataset) and RAD(road anomaly dataset) https://www.kaggle.com/datasets/odins0n/ucf-crime-datasethttps://data.mendeley.com/datasets/8chk8vdn2z/1.(Accessed on 12 January 2025).

Declarations

Conflict of interest

The authors declare no conflict of interest.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Sarfaraz Abdul Sattar Natha, Email: sasattar@ssuet.edu.pk.

Mohammad Siraj, Email: siraj@ksu.edu.sa.

References

1.Elmetwally, A., Eldeeb, R. & Elmougy, S. Deep learning based anomaly detection in Real-Time video. Multimed Tools Appl.10.1007/s11042-024-19116-9 (2024). [Google Scholar]
2.Omarov, B., Narynov, S., Zhumanov, Z., Gumar, A. & Khassanova, M. State-of-the-Art violence detection techniques in video surveillance security systems: A systematic review. PeerJ Comput. Sci.8, e920. 10.7717/peerj-cs.920 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Natha, S. A. & Systematic Review of Anomaly Detection Using Machine and Deep Learning Techniques. Quaid-E-Awam Univ. Res. J. Eng. Sci. Technol.20, 83–94, doi:10.52584/QRJ.2001.11. (2022). [Google Scholar]
4.Ul Amin et al. Video anomaly detection utilizing efficient Spatiotemporal feature fusion with 3D convolutions and long short-term memory modules. Adv. Intell. Syst.6, 2300706. 10.1002/aisy.202300706 (2024). [Google Scholar]
5.Hodge, V. & Austin, J. A. Survey of outlier detection methodologies. Artif. Intell. Rev.22, 85–126. 10.1023/B:AIRE.0000045502.10941.a9 (2004). [Google Scholar]
6.Amin, S. et al. Enhanced anomaly detection in pandemic surveillance videos: an attention approach with EfficientNet-B0 and CBAM integration. IEEE Access.10.1109/ACCESS.2024.3488797 (2024). [Google Scholar]
7.Chandola, V., Banerjee, A. & Kumar, V. Anomaly detection for discrete sequences: A survey. IEEE Trans. Knowl. Data Eng.24, 823–839. 10.1109/TKDE.2010.235 (2012). [Google Scholar]
8.Mienye, I. D. & Jere, N. Deep learning for credit card fraud detection: A review of Algorithms, Challenges, and solutions. IEEE Access.12, 96893–96910. 10.1109/ACCESS.2024.3426955 (2024). [Google Scholar]
9.Hu, X., Tang, T., Tan, L. & Zhang, H. Fault detection for point machines: A Review, Challenges, and perspectives. Actuators12, 391. 10.3390/act12100391 (2023). [Google Scholar]
10.Natha, S. et al. Automated brain tumor identification in biomedical radiology images: A Multi-Model ensemble deep learning approach. Appl. Sci.14, 2210. 10.3390/app14052210 (2024). [Google Scholar]
11.Srivastava, A. & Bharti, M. R. Hybrid machine learning model for anomaly detection in unlabelled data of wireless sensor networks. Wirel. Pers. Commun.129, 2693–2710. 10.1007/s11277-023-10253-2 (2023). [Google Scholar]
12.Jokhio, F. A. et al. Scalable and generalized deep ensemble model for road anomaly detection in surveillance videos. Comput. Mater. Contin. 81, 3707–3729. 10.32604/cmc.2024.057684 (2024). [Google Scholar]
13.Zhang, M., Li, T., Shi, H., Li, Y. & Hui, P. A. Decomposition Approach for Urban Anomaly Detection Across Spatiotemporal Data. In Proceedings of the Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence Organization: Macao, China, August ; pp. 6043–6049. (2019).
14.Castro, P. S., Zhang, D., Chen, C., Li, S. & Pan, G. From taxi GPS traces to social and community dynamics: A survey. ACM Comput. Surv.46, 1–34. 10.1145/2543581.2543584 (2013). [Google Scholar]
15.Zhang, M. et al. Urban anomaly analytics: Description, Detection, and prediction. IEEE Trans. Big Data. 8, 809–826. 10.1109/TBDATA.2020.2991008 (2022). [Google Scholar]
16.Zheng, Y., Zhang, H. & Yu, Y. Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains. In Proceedings of the Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems; ACM: Seattle Washington, November 3 ; pp. 1–10. (2015).
17.Kriegel, H. P., Kröger, P. & Zimek, A. Clustering High-Dimensional data: A survey on subspace clustering, Pattern-Based clustering, and correlation clustering. ACM Trans. Knowl. Discov Data. 3, 1–58. 10.1145/1497577.1497578 (2009). [Google Scholar]
18.Goldstein, M. & Uchida, S. A. Comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE. 11, e0152173. 10.1371/journal.pone.0152173 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lavin, A. & Ahmad, S. Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA); IEEE: Miami, FL, USA, December ; pp. 38–44. (2015).
20.Erhan, L. et al. Smart anomaly detection in sensor systems: A Multi-Perspective review. Inf. Fusion. 67, 64–79. 10.1016/j.inffus.2020.10.001 (2021). [Google Scholar]
21.Karri, S. L., De Silva, L. C., Lai, D. T. C. & Yong, S. Y. Classification and prediction of driving behaviour at a traffic intersection using SVM and KNN. SN Comput. Sci.2, 209. 10.1007/s42979-021-00588-7 (2021). [Google Scholar]
22.Santos, D., Saias, J., Quaresma, P. & Nogueira, V. B. Machine learning approaches to traffic accident analysis and hotspot prediction. Computers10, 157. 10.3390/computers10120157 (2021). [Google Scholar]
23.Yassin, S. S. Pooja road accident prediction and model interpretation using a hybrid K-Means and random forest algorithm approach. SN Appl. Sci.2, 1576. 10.1007/s42452-020-3125-1 (2020). [Google Scholar]
24.Elakiya, V., Aruna, P. & Puviarasan, N. Mosaicking based optimal threshold image enhancement for violence detection with deep quadratic attention mechanism. J. Big Data. 11, 147. 10.1186/s40537-024-00984-9 (2024). [Google Scholar]
25.Chriki, A., Touati, H., Snoussi, H. & Kamoun, F. Deep learning and handcrafted features for One-Class anomaly detection in UAV video. Multimed Tools Appl.80, 2599–2620. 10.1007/s11042-020-09774-w (2021). [Google Scholar]
26.Shoaib, M. et al. Deep Learning-Assisted visual attention mechanism for anomaly detection in videos. Multimed Tools Appl.83, 73363–73390. 10.1007/s11042-023-17770-z (2023). [Google Scholar]
27.Yao, X., Li, R., Qian, Z., Wang, L. & Zhang, C. Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection (2024).
28.Nawaratne, R. Spatiotemporal anomaly detection using deep learning for Real-Time video surveillance. IEEE Trans. Industr. Inf.16.1, 393–402 (2019). [Google Scholar]
29.Tariq, A. U. R., Farooq, S., Jaleel, H. & Wasif, A. Anomaly detection with particle filtering for online video surveillance. IEEE Access.9, 19457–19468. 10.1109/ACCESS.2021.3054040 (2021). [Google Scholar]
30.Wasim, M. et al. Content oriented 3D-CNN sequence learning architecture for academic activities recognition using a realistic CAD dataset. Sci. Rep.15, 25250. 10.1038/s41598-025-07620-3 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Bouindour, S., Hittawe, M. M., Mahfouz, S. & Snoussi, H. Abnormal Event Detection Using Convolutional Neural Networks and 1-Class SVM Classifier. In Proceedings of the 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017); Institution of Engineering and Technology: Madrid, Spain, ; pp. 1–6. (2017).
32.Arul, U. et al. Effective anomaly identification in surveillance videos based on adaptive recurrent neural network. J. Electr. Eng. Technol.19, 1793–1805. 10.1007/s42835-023-01630-9 (2024). [Google Scholar]
33.Guan, Y., Hu, W. & Hu, X. Abnormal behavior recognition using 3D-CNN combined with LSTM. Multimed Tools Appl.80, 18787–18801. 10.1007/s11042-021-10667-9 (2021). [Google Scholar]
34.Ding, F. et al. Int. J. Robot Autom.33, doi:10.2316/Journal.206.2018.5.206-0061. (2018).
35.Perera, P. & Patel, V. M. Learning deep features for One-Class classification. IEEE Trans. Image Process.28, 5450–5463. 10.1109/TIP.2019.2917862 (2019). [DOI] [PubMed] [Google Scholar]
36.Lei, S., Song, J., Wang, T., Wang, F. & Yan, Z. Attention U-Net based on Multi-Scale feature extraction and WSDAN data augmentation for video anomaly detection. Multimed Syst.30, 118. 10.1007/s00530-024-01320-0 (2024). [Google Scholar]
37.Ganagavalli, K. & Santhi, V. Y. O. L. O. B. Anomaly activity detection system for human behavior analysis and crime mitigation. Signal. Image Video Process.18, 417–427. 10.1007/s11760-024-03164-7 (2024). [Google Scholar]
38.Lee, J., Lee, S., Cho, W., Siddiqui, Z. A. & Park, U. Vision Transformer-Based tailing detection in videos. Appl. Sci.11, 11591. 10.3390/app112411591 (2021). [Google Scholar]
39.Yan, X., Yang, J., Sohn, K. & Lee, H. Attribute2Image: Conditional Image Generation from Visual Attributes. In Computer Vision – ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, ; Vol. 9908, pp. 776–791 ISBN 978-3-319-46492-3. (2016).
40.Mujkic, E., Philipsen, M. P., Moeslund, T. B., Christiansen, M. P. & Ravn, O. Anomaly detection for agricultural vehicles using autoencoders. Sensors22, 3608. 10.3390/s22103608 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Lee, J. W. & Kang, H. S. Three-Stage deep learning framework for video surveillance. Appl. Sci.14, 408. 10.3390/app14010408 (2024). [Google Scholar]
42.Rezaee, K., Rezakhani, S. M., Khosravi, M. R. & Moghimi, M. K. A survey on deep Learning-Based Real-Time crowd anomaly detection for secure distributed video surveillance. Pers. Ubiquitous Comput.28, 135–151. 10.1007/s00779-021-01586-5 (2024). [Google Scholar]
43.Hong, Z. August. A Preliminary Study on Artificial Neural Network. In Proceedings of the 2011 6th IEEE Joint International Information Technology and Artificial Intelligence Conference; IEEE: Chongqing, China, ; pp. 336–338. (2011).
44.Wang, X., Zhao, L., Wang, S. A. & Novel, S. V. M. Video Object Extraction Technology. In Proceedings of the 2012 8th International Conference on Natural Computation; IEEE: Chongqing, Sichuan, China, May ; pp. 44–48. (2012).
45.Peretz, O., Koren, M. & Koren, O. Naive Bayes Classifier – An ensemble procedure for recall and precision enrichment. Eng. Appl. Artif. Intell.136, 108972. 10.1016/j.engappai.2024.108972 (2024). [Google Scholar]
46.He, W., Wang, J., Wang, L., Pan, R. & Gao, W. A semantic segmentation algorithm for fashion images based on modified mask RCNN. Multimed Tools Appl.82, 28427–28444. 10.1007/s11042-023-14958-1 (2023). [Google Scholar]
47.Xue, C., Lin, B., Zheng, J., Li, J. & Feng, Q. Robust correlation tracking with Closed-Loop feedback control. Multimed Syst.31, 209. 10.1007/s00530-025-01816-3 (2025). [Google Scholar]
48.Kang, K. et al. T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol.28, 2896–2907. 10.1109/TCSVT.2017.2736553 (2018). [Google Scholar]
49.Shah, S. & Tembhurne, J. Object detection using convolutional neural networks and Transformer-Based models: A review. J. Electr. Syst. Inf. Technol.10, 54. 10.1186/s43067-023-00123-z (2023). [Google Scholar]
50.Nogueira, A. F. R., Oliveira, H. S., Machado, J. J. M. & Tavares, J. M. R.S. Transformers for urban sound Classification—A comprehensive performance evaluation. Sensors22, 8874. 10.3390/s22228874 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Nimma, D. et al. Object detection in Real-Time video surveillance using attention based Transformer-YOLOv8 model. Alex Eng. J.118, 482–495. 10.1016/j.aej.2025.01.032 (2025). [Google Scholar]
52.Natha, S. et al. Improving Traffic Surveillance: Deep Learning Approach for Road Anomaly Detection in Videos. In Proceedings of the 2024 IEEE 3rd International Conference on Computing and (ICMI); IEEE: Mt Pleasant, MI, USA, April 13 ; pp. 1–7. (2024).
53.Vosta, S. & Yow, K. C. A CNN-RNN combined structure for Real-World violence detection in surveillance cameras. Appl. Sci.12, 1021. 10.3390/app12031021 (2022). [Google Scholar]
54.Londoño Lopera, J. C., Bolaños Martinez, F. & Fletscher Bocanegra, L. A. Building a custom crime detection dataset and implementing a 3D convolutional neural network for video analysis. Algorithms18, 103. 10.3390/a18020103 (2025). [Google Scholar]
55.Sudhakaran, S. & Lanz, O. Learning to Detect Violent Videos Using Convolutional Long Short-Term Memory. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS); IEEE: Lecce, Italy, August ; pp. 1–6. (2017).
56.Chen, B. et al. Spatiotemporal convolutional neural network with convolutional block attention module for Micro-Expression recognition. Information11 (380). 10.3390/info11080380 (2020).
57.Su, Y., Lin, G., Zhu, J. & Wu, Q. Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition. In Computer Vision – ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, ; Vol. 12349, pp. 74–90 ISBN 978-3-030-58547-1. (2020).
58.Islam, Z., Rukonuzzaman, M., Ahmed, R., Kabir, M. H. & Farazi, M. Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN); IEEE: Shenzhen, China, July 18 ; pp. 1–8. (2021).
59.Martínez-Mascorro, G. A., Abreu-Pederzini, J. R., Ortiz-Bayliss, J. C., Garcia-Collantes, A. & Terashima-Marín, H. Criminal intention detection at early stages of shoplifting cases by using 3D convolutional neural networks. Computation910.3390/computation9020024 (2021).
60.Ansari, M. A. & Singh, D. K. An Expert Eye for Identifying Shoplifters in Mega Stores. In International Conference on Innovative Computing and Communications; Khanna, A., Gupta, D., Bhattacharyya, S., Hassanien, A.E., Anand, S., Jaiswal, A., Eds.; Advances in Intelligent Systems and Computing; Springer Singapore: Singapore, ; Vol. 1394, pp. 107–115 ISBN 9789811630705. (2022).
61.Ullah, W., Ullah, A., Hussain, T., Khan, Z. A. & Baik, S. W. An efficient anomaly recognition framework using an attention residual LSTM in surveillance videos. Sensors21, 2811. 10.3390/s21082811 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Muneer, I., Saddique, M., Habib, Z. & Mohamed, H. G. Shoplifting detection using hybrid neural network CNN-BiLSMT and development of benchmark dataset. Appl. Sci.13, 8341. 10.3390/app13148341 (2023). [Google Scholar]
63.Tur, A. O., Dall’Asen, N., Beyan, C. & Ricci, E. Exploring Diffusion Models for Unsupervised Video Anomaly Detection. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP); IEEE: Kuala Lumpur, Malaysia, October 8 ; pp. 2540–2544. (2023).
64.Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition (2014).
65.Biradar, K., Dube, S., Vipparthi, S. K. & DEARESt Deep Convolutional Aberrant Behavior Detection in Real-World Scenarios. In Proceedings of the IEEE 13th International Conference on Industrial and Information Systems (ICIIS); IEEE: Rupnagar, India, December 2018; pp. 163–167. (2018).
66.Zhong, J. X. et al. Ge Li; Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection. Zhong_2019_CVPR (2019).
67.Tian, Y. et al. Weakly-Supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, October ; pp. 4955–4966. (2021).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Elmetwally, A., Eldeeb, R. & Elmougy, S. Deep learning based anomaly detection in Real-Time video. Multimed Tools Appl.10.1007/s11042-024-19116-9 (2024). [Google Scholar]

[CR2] 2.Omarov, B., Narynov, S., Zhumanov, Z., Gumar, A. & Khassanova, M. State-of-the-Art violence detection techniques in video surveillance security systems: A systematic review. PeerJ Comput. Sci.8, e920. 10.7717/peerj-cs.920 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Natha, S. A. & Systematic Review of Anomaly Detection Using Machine and Deep Learning Techniques. Quaid-E-Awam Univ. Res. J. Eng. Sci. Technol.20, 83–94, doi:10.52584/QRJ.2001.11. (2022). [Google Scholar]

[CR4] 4.Ul Amin et al. Video anomaly detection utilizing efficient Spatiotemporal feature fusion with 3D convolutions and long short-term memory modules. Adv. Intell. Syst.6, 2300706. 10.1002/aisy.202300706 (2024). [Google Scholar]

[CR5] 5.Hodge, V. & Austin, J. A. Survey of outlier detection methodologies. Artif. Intell. Rev.22, 85–126. 10.1023/B:AIRE.0000045502.10941.a9 (2004). [Google Scholar]

[CR6] 6.Amin, S. et al. Enhanced anomaly detection in pandemic surveillance videos: an attention approach with EfficientNet-B0 and CBAM integration. IEEE Access.10.1109/ACCESS.2024.3488797 (2024). [Google Scholar]

[CR7] 7.Chandola, V., Banerjee, A. & Kumar, V. Anomaly detection for discrete sequences: A survey. IEEE Trans. Knowl. Data Eng.24, 823–839. 10.1109/TKDE.2010.235 (2012). [Google Scholar]

[CR8] 8.Mienye, I. D. & Jere, N. Deep learning for credit card fraud detection: A review of Algorithms, Challenges, and solutions. IEEE Access.12, 96893–96910. 10.1109/ACCESS.2024.3426955 (2024). [Google Scholar]

[CR9] 9.Hu, X., Tang, T., Tan, L. & Zhang, H. Fault detection for point machines: A Review, Challenges, and perspectives. Actuators12, 391. 10.3390/act12100391 (2023). [Google Scholar]

[CR10] 10.Natha, S. et al. Automated brain tumor identification in biomedical radiology images: A Multi-Model ensemble deep learning approach. Appl. Sci.14, 2210. 10.3390/app14052210 (2024). [Google Scholar]

[CR11] 11.Srivastava, A. & Bharti, M. R. Hybrid machine learning model for anomaly detection in unlabelled data of wireless sensor networks. Wirel. Pers. Commun.129, 2693–2710. 10.1007/s11277-023-10253-2 (2023). [Google Scholar]

[CR12] 12.Jokhio, F. A. et al. Scalable and generalized deep ensemble model for road anomaly detection in surveillance videos. Comput. Mater. Contin. 81, 3707–3729. 10.32604/cmc.2024.057684 (2024). [Google Scholar]

[CR13] 13.Zhang, M., Li, T., Shi, H., Li, Y. & Hui, P. A. Decomposition Approach for Urban Anomaly Detection Across Spatiotemporal Data. In Proceedings of the Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence Organization: Macao, China, August ; pp. 6043–6049. (2019).

[CR14] 14.Castro, P. S., Zhang, D., Chen, C., Li, S. & Pan, G. From taxi GPS traces to social and community dynamics: A survey. ACM Comput. Surv.46, 1–34. 10.1145/2543581.2543584 (2013). [Google Scholar]

[CR15] 15.Zhang, M. et al. Urban anomaly analytics: Description, Detection, and prediction. IEEE Trans. Big Data. 8, 809–826. 10.1109/TBDATA.2020.2991008 (2022). [Google Scholar]

[CR16] 16.Zheng, Y., Zhang, H. & Yu, Y. Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets across Different Domains. In Proceedings of the Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems; ACM: Seattle Washington, November 3 ; pp. 1–10. (2015).

[CR17] 17.Kriegel, H. P., Kröger, P. & Zimek, A. Clustering High-Dimensional data: A survey on subspace clustering, Pattern-Based clustering, and correlation clustering. ACM Trans. Knowl. Discov Data. 3, 1–58. 10.1145/1497577.1497578 (2009). [Google Scholar]

[CR18] 18.Goldstein, M. & Uchida, S. A. Comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLOS ONE. 11, e0152173. 10.1371/journal.pone.0152173 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Lavin, A. & Ahmad, S. Evaluating Real-Time Anomaly Detection Algorithms -- The Numenta Anomaly Benchmark. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA); IEEE: Miami, FL, USA, December ; pp. 38–44. (2015).

[CR20] 20.Erhan, L. et al. Smart anomaly detection in sensor systems: A Multi-Perspective review. Inf. Fusion. 67, 64–79. 10.1016/j.inffus.2020.10.001 (2021). [Google Scholar]

[CR21] 21.Karri, S. L., De Silva, L. C., Lai, D. T. C. & Yong, S. Y. Classification and prediction of driving behaviour at a traffic intersection using SVM and KNN. SN Comput. Sci.2, 209. 10.1007/s42979-021-00588-7 (2021). [Google Scholar]

[CR22] 22.Santos, D., Saias, J., Quaresma, P. & Nogueira, V. B. Machine learning approaches to traffic accident analysis and hotspot prediction. Computers10, 157. 10.3390/computers10120157 (2021). [Google Scholar]

[CR23] 23.Yassin, S. S. Pooja road accident prediction and model interpretation using a hybrid K-Means and random forest algorithm approach. SN Appl. Sci.2, 1576. 10.1007/s42452-020-3125-1 (2020). [Google Scholar]

[CR24] 24.Elakiya, V., Aruna, P. & Puviarasan, N. Mosaicking based optimal threshold image enhancement for violence detection with deep quadratic attention mechanism. J. Big Data. 11, 147. 10.1186/s40537-024-00984-9 (2024). [Google Scholar]

[CR25] 25.Chriki, A., Touati, H., Snoussi, H. & Kamoun, F. Deep learning and handcrafted features for One-Class anomaly detection in UAV video. Multimed Tools Appl.80, 2599–2620. 10.1007/s11042-020-09774-w (2021). [Google Scholar]

[CR26] 26.Shoaib, M. et al. Deep Learning-Assisted visual attention mechanism for anomaly detection in videos. Multimed Tools Appl.83, 73363–73390. 10.1007/s11042-023-17770-z (2023). [Google Scholar]

[CR27] 27.Yao, X., Li, R., Qian, Z., Wang, L. & Zhang, C. Hierarchical Gaussian Mixture Normalizing Flow Modeling for Unified Anomaly Detection (2024).

[CR28] 28.Nawaratne, R. Spatiotemporal anomaly detection using deep learning for Real-Time video surveillance. IEEE Trans. Industr. Inf.16.1, 393–402 (2019). [Google Scholar]

[CR29] 29.Tariq, A. U. R., Farooq, S., Jaleel, H. & Wasif, A. Anomaly detection with particle filtering for online video surveillance. IEEE Access.9, 19457–19468. 10.1109/ACCESS.2021.3054040 (2021). [Google Scholar]

[CR30] 30.Wasim, M. et al. Content oriented 3D-CNN sequence learning architecture for academic activities recognition using a realistic CAD dataset. Sci. Rep.15, 25250. 10.1038/s41598-025-07620-3 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Bouindour, S., Hittawe, M. M., Mahfouz, S. & Snoussi, H. Abnormal Event Detection Using Convolutional Neural Networks and 1-Class SVM Classifier. In Proceedings of the 8th International Conference on Imaging for Crime Detection and Prevention (ICDP 2017); Institution of Engineering and Technology: Madrid, Spain, ; pp. 1–6. (2017).

[CR32] 32.Arul, U. et al. Effective anomaly identification in surveillance videos based on adaptive recurrent neural network. J. Electr. Eng. Technol.19, 1793–1805. 10.1007/s42835-023-01630-9 (2024). [Google Scholar]

[CR33] 33.Guan, Y., Hu, W. & Hu, X. Abnormal behavior recognition using 3D-CNN combined with LSTM. Multimed Tools Appl.80, 18787–18801. 10.1007/s11042-021-10667-9 (2021). [Google Scholar]

[CR34] 34.Ding, F. et al. Int. J. Robot Autom.33, doi:10.2316/Journal.206.2018.5.206-0061. (2018).

[CR35] 35.Perera, P. & Patel, V. M. Learning deep features for One-Class classification. IEEE Trans. Image Process.28, 5450–5463. 10.1109/TIP.2019.2917862 (2019). [DOI] [PubMed] [Google Scholar]

[CR36] 36.Lei, S., Song, J., Wang, T., Wang, F. & Yan, Z. Attention U-Net based on Multi-Scale feature extraction and WSDAN data augmentation for video anomaly detection. Multimed Syst.30, 118. 10.1007/s00530-024-01320-0 (2024). [Google Scholar]

[CR37] 37.Ganagavalli, K. & Santhi, V. Y. O. L. O. B. Anomaly activity detection system for human behavior analysis and crime mitigation. Signal. Image Video Process.18, 417–427. 10.1007/s11760-024-03164-7 (2024). [Google Scholar]

[CR38] 38.Lee, J., Lee, S., Cho, W., Siddiqui, Z. A. & Park, U. Vision Transformer-Based tailing detection in videos. Appl. Sci.11, 11591. 10.3390/app112411591 (2021). [Google Scholar]

[CR39] 39.Yan, X., Yang, J., Sohn, K. & Lee, H. Attribute2Image: Conditional Image Generation from Visual Attributes. In Computer Vision – ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, ; Vol. 9908, pp. 776–791 ISBN 978-3-319-46492-3. (2016).

[CR40] 40.Mujkic, E., Philipsen, M. P., Moeslund, T. B., Christiansen, M. P. & Ravn, O. Anomaly detection for agricultural vehicles using autoencoders. Sensors22, 3608. 10.3390/s22103608 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Lee, J. W. & Kang, H. S. Three-Stage deep learning framework for video surveillance. Appl. Sci.14, 408. 10.3390/app14010408 (2024). [Google Scholar]

[CR42] 42.Rezaee, K., Rezakhani, S. M., Khosravi, M. R. & Moghimi, M. K. A survey on deep Learning-Based Real-Time crowd anomaly detection for secure distributed video surveillance. Pers. Ubiquitous Comput.28, 135–151. 10.1007/s00779-021-01586-5 (2024). [Google Scholar]

[CR43] 43.Hong, Z. August. A Preliminary Study on Artificial Neural Network. In Proceedings of the 2011 6th IEEE Joint International Information Technology and Artificial Intelligence Conference; IEEE: Chongqing, China, ; pp. 336–338. (2011).

[CR44] 44.Wang, X., Zhao, L., Wang, S. A. & Novel, S. V. M. Video Object Extraction Technology. In Proceedings of the 2012 8th International Conference on Natural Computation; IEEE: Chongqing, Sichuan, China, May ; pp. 44–48. (2012).

[CR45] 45.Peretz, O., Koren, M. & Koren, O. Naive Bayes Classifier – An ensemble procedure for recall and precision enrichment. Eng. Appl. Artif. Intell.136, 108972. 10.1016/j.engappai.2024.108972 (2024). [Google Scholar]

[CR46] 46.He, W., Wang, J., Wang, L., Pan, R. & Gao, W. A semantic segmentation algorithm for fashion images based on modified mask RCNN. Multimed Tools Appl.82, 28427–28444. 10.1007/s11042-023-14958-1 (2023). [Google Scholar]

[CR47] 47.Xue, C., Lin, B., Zheng, J., Li, J. & Feng, Q. Robust correlation tracking with Closed-Loop feedback control. Multimed Syst.31, 209. 10.1007/s00530-025-01816-3 (2025). [Google Scholar]

[CR48] 48.Kang, K. et al. T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol.28, 2896–2907. 10.1109/TCSVT.2017.2736553 (2018). [Google Scholar]

[CR49] 49.Shah, S. & Tembhurne, J. Object detection using convolutional neural networks and Transformer-Based models: A review. J. Electr. Syst. Inf. Technol.10, 54. 10.1186/s43067-023-00123-z (2023). [Google Scholar]

[CR50] 50.Nogueira, A. F. R., Oliveira, H. S., Machado, J. J. M. & Tavares, J. M. R.S. Transformers for urban sound Classification—A comprehensive performance evaluation. Sensors22, 8874. 10.3390/s22228874 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR51] 51.Nimma, D. et al. Object detection in Real-Time video surveillance using attention based Transformer-YOLOv8 model. Alex Eng. J.118, 482–495. 10.1016/j.aej.2025.01.032 (2025). [Google Scholar]

[CR52] 52.Natha, S. et al. Improving Traffic Surveillance: Deep Learning Approach for Road Anomaly Detection in Videos. In Proceedings of the 2024 IEEE 3rd International Conference on Computing and (ICMI); IEEE: Mt Pleasant, MI, USA, April 13 ; pp. 1–7. (2024).

[CR53] 53.Vosta, S. & Yow, K. C. A CNN-RNN combined structure for Real-World violence detection in surveillance cameras. Appl. Sci.12, 1021. 10.3390/app12031021 (2022). [Google Scholar]

[CR54] 54.Londoño Lopera, J. C., Bolaños Martinez, F. & Fletscher Bocanegra, L. A. Building a custom crime detection dataset and implementing a 3D convolutional neural network for video analysis. Algorithms18, 103. 10.3390/a18020103 (2025). [Google Scholar]

[CR55] 55.Sudhakaran, S. & Lanz, O. Learning to Detect Violent Videos Using Convolutional Long Short-Term Memory. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS); IEEE: Lecce, Italy, August ; pp. 1–6. (2017).

[CR56] 56.Chen, B. et al. Spatiotemporal convolutional neural network with convolutional block attention module for Micro-Expression recognition. Information11 (380). 10.3390/info11080380 (2020).

[CR57] 57.Su, Y., Lin, G., Zhu, J. & Wu, Q. Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition. In Computer Vision – ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, ; Vol. 12349, pp. 74–90 ISBN 978-3-030-58547-1. (2020).

[CR58] 58.Islam, Z., Rukonuzzaman, M., Ahmed, R., Kabir, M. H. & Farazi, M. Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN); IEEE: Shenzhen, China, July 18 ; pp. 1–8. (2021).

[CR59] 59.Martínez-Mascorro, G. A., Abreu-Pederzini, J. R., Ortiz-Bayliss, J. C., Garcia-Collantes, A. & Terashima-Marín, H. Criminal intention detection at early stages of shoplifting cases by using 3D convolutional neural networks. Computation910.3390/computation9020024 (2021).

[CR60] 60.Ansari, M. A. & Singh, D. K. An Expert Eye for Identifying Shoplifters in Mega Stores. In International Conference on Innovative Computing and Communications; Khanna, A., Gupta, D., Bhattacharyya, S., Hassanien, A.E., Anand, S., Jaiswal, A., Eds.; Advances in Intelligent Systems and Computing; Springer Singapore: Singapore, ; Vol. 1394, pp. 107–115 ISBN 9789811630705. (2022).

[CR61] 61.Ullah, W., Ullah, A., Hussain, T., Khan, Z. A. & Baik, S. W. An efficient anomaly recognition framework using an attention residual LSTM in surveillance videos. Sensors21, 2811. 10.3390/s21082811 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR62] 62.Muneer, I., Saddique, M., Habib, Z. & Mohamed, H. G. Shoplifting detection using hybrid neural network CNN-BiLSMT and development of benchmark dataset. Appl. Sci.13, 8341. 10.3390/app13148341 (2023). [Google Scholar]

[CR63] 63.Tur, A. O., Dall’Asen, N., Beyan, C. & Ricci, E. Exploring Diffusion Models for Unsupervised Video Anomaly Detection. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP); IEEE: Kuala Lumpur, Malaysia, October 8 ; pp. 2540–2544. (2023).

[CR64] 64.Simonyan, K. & Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition (2014).

[CR65] 65.Biradar, K., Dube, S., Vipparthi, S. K. & DEARESt Deep Convolutional Aberrant Behavior Detection in Real-World Scenarios. In Proceedings of the IEEE 13th International Conference on Industrial and Information Systems (ICIIS); IEEE: Rupnagar, India, December 2018; pp. 163–167. (2018).

[CR66] 66.Zhong, J. X. et al. Ge Li; Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection. Zhong_2019_CVPR (2019).

[CR67] 67.Tian, Y. et al. Weakly-Supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, October ; pp. 4955–4966. (2021).

PERMALINK

A fusion approach of YOLOv8 and CNN-Transformer for End-to-End road anomaly detection

Sarfaraz Abdul Sattar Natha

Mohammad Siraj

Saif A Alsaif

Fahad Farooq

Admali Shah

Maqsood Mahmud

Abstract

Introduction

Related work

Machine learning approach

Deep learning approach

Table 1.

Methodology

Fig. 1.

Bounding box masks extractor

Fig. 2.

Fig. 3.

Fig. 4.

Road anomaly detector

Fig. 5.

InceptionV3

Fig. 6.

Experimental assessment and performance

RAD dataset

Fig. 7.

Table 2.

UCF dataset

Fig. 8.

Model performance

Experimental setup

Data augmentation

Fig. 9.

Results

UCF dataset results

Table 3.

Table 4.

Table 6.

Fig. 10.

Fig. 11.

Table 5.

RAD dataset results

Table 7.

Table 8.

Fig. 12.

Conclusion and future work

Acknowledgements

Author contributions

Data availability

Declarations

Conflict of interest

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases