An explainable deep learning framework for video violence detection using unsupervised keyframe selection and attention-based CNN

Rashid Azim; Naveed Abbas; Hend Khalid Alkahtani; Ayman Qahmash

doi:10.1038/s41598-026-40977-7

. 2026 Feb 26;16:11098. doi: 10.1038/s41598-026-40977-7

An explainable deep learning framework for video violence detection using unsupervised keyframe selection and attention-based CNN

Rashid Azim ¹, Naveed Abbas ¹, Hend Khalid Alkahtani ², Ayman Qahmash ^3,^✉

PMCID: PMC13044314 PMID: 41748832

Abstract

The exponential growth of video data from surveillance and online platforms has heightened the demand for intelligent, explainable systems capable of detecting violence in real time. This study proposes a novel Explainable Attention-Enhanced Convolutional Neural Network (CNN) framework that integrates unsupervised keyframe selection, attention-driven feature learning, and Grad-CAM++-based interpretability to address redundancy, transparency, and generalization challenges in video violence detection. The proposed model automatically extracts representative keyframes using similarity-based clustering, reducing computational overhead while retaining essential temporal information. Attention modules are embedded within the CNN backbone to enhance spatial–temporal feature discrimination, while Grad-CAM + + provides interpretable visual insights into the model’s decision process. Comprehensive experiments on five benchmark datasets—RLVS, Hockey Fight, Violent Flow, ShanghaiTech, and UCF-Crime—demonstrate that the framework achieves superior performance, with an average accuracy of 94.6% and F1-score of 93.9%, outperforming state-of-the-art models such as C3D, I3D, ResNet-LSTM, and ViViT. The model also delivers near-real-time efficiency (≈ 62 FPS) with reduced memory utilization (6.8 GB), confirming its suitability for deployment in surveillance and public safety systems. Statistical analysis using ANOVA and Tukey’s HSD tests verified that keyframe selection and attention modules significantly improve performance (p < 0.05) with large effect sizes (η² = 0.76). The integration of interpretability further enhances reliability by localizing violence-relevant regions in frames. Overall, the proposed explainable framework establishes a robust, efficient, and transparent solution for automated violence detection in diverse real-world scenarios.

Keywords: Violence detection, Explainable deep learning, Attention-enhanced CNN, Keyframe selection, Grad-CAM + + visualization

Subject terms: Engineering, Mathematics and computing

Introduction

The rapid growth of video surveillance systems and social media platforms has resulted in an exponential increase in video data¹. This expansion has simultaneously created opportunities and challenges in developing automated systems capable of detecting abnormal or violent behaviors in real-time². Violence detection in videos is a crucial task across various domains such as public security, sports analysis, law enforcement, and social media moderation³. However, despite significant progress in computer vision and deep learning, accurately detecting violent incidents remains a complex problem due to the high variability and unpredictability of human behavior, as well as the noisy, cluttered, and dynamic nature of real-world videos⁴.

Traditional video analysis techniques have relied heavily on handcrafted features, such as optical flow, motion histograms, or trajectory analysis, to identify violent actions⁵. These approaches, however, are limited in their ability to generalize across diverse environments, camera angles, and lighting conditions⁶. With the advent of deep learning, convolutional neural networks (CNNs) and recurrent architectures such as Long Short-Term Memory (LSTM) networks have achieved remarkable success in extracting spatiotemporal patterns from videos⁷. Nonetheless, these models often require extensive computational resources and large annotated datasets, which are difficult to obtain in violence detection tasks⁸. Moreover, their black-box nature poses a challenge to interpretability—an essential requirement in security applications where decision transparency is critical⁹.

One of the fundamental challenges in violence detection is the redundant information present in video sequences¹⁰. Continuous video frames often contain similar visual content, adding computational overhead without improving model performance¹¹. The redundancy problem not only slows down the training process but also increases memory consumption, leading to inefficiencies during inference¹². Eliminating such unnecessary frames is therefore a crucial preprocessing step. However, most existing approaches to keyframe extraction rely on complex unsupervised clustering algorithms or optical-flow-based heuristics that are computationally expensive and dataset-dependent¹³. An intelligent, unsupervised deep learning-based keyframe selection mechanism can significantly reduce video length while retaining essential spatiotemporal information, thus improving both the speed and accuracy of violence detection systems¹⁴.

In addition to redundancy, violence detection faces challenges stemming from data quality and imbalance¹⁵. Real-world video datasets often suffer from noise, low resolution, and poor contrast—particularly in surveillance settings with low-light or occluded scenes¹⁶. These artifacts obscure visual cues crucial for recognizing violent actions such as punching, kicking, or sudden movement bursts¹⁷. Furthermore, the scarcity of well-balanced datasets containing diverse violence classes restricts the ability of models to learn generalizable representations¹⁸. Datasets like Hockey Fight and Real-Life Violence Situations (RLVS) provide valuable benchmarks, yet they remain limited in scale and diversity compared to real-world conditions¹⁹. Surveillance-oriented datasets such as ShanghaiTech and UCF-Crime capture complex environmental variations but exhibit class imbalance, with far more non-violent than violent clips²⁰. This imbalance often biases models toward the majority class, resulting in degraded recall and misclassification of violent events²¹.

Capturing both spatial and temporal cues effectively remains another open challenge. Spatial features describe the appearance of actors and environments within a frame, while temporal features capture motion dynamics and contextual transitions across frames²². Most conventional CNN-based methods excel at extracting spatial features but lack temporal awareness, while RNN-based architectures capture temporal evolution at the cost of spatial detail²³. Hybrid models that combine convolutional layers with attention-based temporal modeling have shown promise in bridging this gap²⁴. Attention mechanisms enable models to focus on the most informative regions or frames within a sequence, mimicking the human cognitive process of selectively attending to salient motion patterns²⁵. By integrating attention modules within CNN architectures, models can better identify critical frames associated with violent events and suppress irrelevant background activities²⁶.

Explainability has recently become a major focus in computer vision research, especially for high-stakes applications such as public safety²⁷. Conventional deep learning models often behave as opaque systems, making it difficult for practitioners to understand the reasoning behind their decisions²⁸. In violence detection, the interpretability of a model’s output is vital for building trust and ensuring accountability²⁹. Explainable AI (XAI) techniques—such as Gradient-weighted Class Activation Mapping (Grad-CAM) and its variants—enable visualization of the regions or frames that contribute most to a model’s decision³⁰. Integrating such explainability mechanisms within violence detection frameworks allows users to validate model predictions, identify potential biases, and gain insight into model attention patterns, which is particularly useful for forensics and evidence verification³¹.

Given these limitations, there is a growing demand for efficient, interpretable, and robust deep learning solutions that can process large volumes of video data without sacrificing accuracy or transparency³². Recent studies have demonstrated that combining keyframe extraction, attention mechanisms, and explainable CNN architectures can substantially improve performance in video classification tasks³³. However, their potential in violence detection, particularly across both surveillance and non-surveillance contexts, remains underexplored. A unified framework that leverages unsupervised keyframe selection to reduce redundancy, employs attention-driven CNN blocks to enhance feature discrimination, and incorporates explainable visualization tools can address these challenges effectively³⁴.

The present study proposes an explainable violence detection model that integrates automatic keyframe selection, attention-based feature extraction, and CNN-based classification for improved interpretability and performance³⁵. The proposed approach is validated on multiple benchmark datasets, including non-surveillance datasets (Real-Life Violence Situation, Hockey Fight, and Violent Flow) and surveillance datasets (ShanghaiTech and UCF-Crime), ensuring comprehensive evaluation under varied real-world scenarios. This framework aims to advance the field of video violence detection by addressing the intertwined challenges of redundancy, interpretability, and generalization. The major contributions of this study are summarized as follows:

A novel unsupervised deep learning–based keyframe extraction technique is introduced to automatically eliminate redundant frames from video sequences, thereby reducing computational cost and accelerating the training and inference processes without compromising essential temporal information.
A new, diversified dataset is developed encompassing multiple violence categories from both surveillance and non-surveillance contexts. This dataset provides balanced class representation and enhanced variability, ensuring robust and generalized model training.
An advanced deep learning architecture integrating attention mechanisms within a CNN framework is proposed to capture discriminative spatial–temporal features of violent actions more effectively. The attention modules guide the model to focus selectively on salient motion regions and frames, improving recognition accuracy.
The proposed system incorporates explainable AI techniques such as Grad-CAM + + to visualize model attention regions, enabling interpretability and transparency of decision-making processes, which are essential for high-trust applications like surveillance and public safety.
The model’s performance is rigorously validated on multiple benchmark datasets—including Real-Life Violence Situation, Hockey Fight, Violent Flow, ShanghaiTech, and UCF-Crime—demonstrating its superior generalization capability and robustness across diverse environments and video domains.

The remainder of this article is organized as follows. Section 2 reviews related studies on video-based violence detection, keyframe extraction, attention mechanisms, and explainable AI approaches. Section 3 details the proposed methodology, including the unsupervised keyframe selection and attention-enhanced CNN framework. Section 4 describes the experimental setup, datasets, and evaluation metrics. Section 5 presents and discusses the results with comparisons to existing methods. Finally, Sect. 6 concludes the paper and outlines future research directions in explainable video violence detection.

Literature review

Recent advancements in computer vision and deep learning have significantly advanced automated violence detection in video data³⁵. Existing research can be broadly categorized into handcrafted feature-based methods, deep learning–based spatiotemporal models, attention and transformer-based approaches, and explainable and weakly supervised violence detection frameworks. This structured overview highlights methodological trends, strengths, and limitations, and positions the proposed work within current research gaps.

Early violence detection systems primarily relied on handcrafted motion and appearance descriptors such as optical flow, motion histograms, and trajectory-based features³⁶. Dündar et al. (2024) employed a Bag-of-Words framework using STIP and MoSIFT descriptors with SVM classification to distinguish fight and non-fight videos¹⁹. While achieving nearly 90% accuracy, such approaches exhibit limited generalization in complex scenes due to sensitivity to viewpoint changes, illumination variations, and handcrafted feature rigidity. These limitations motivated the shift toward deep learning–based representations capable of learning richer spatiotemporal patterns directly from data.

With the rise of deep learning, CNNs and recurrent architectures became dominant in violence recognition. Mahmoodi et al. (2024) proposed a two-stream network combining background-suppressed frames and frame differences with a separable ConvLSTM, achieving improved accuracy on RWF-2000³⁷. Verma et al. (2025) fused AlexNet and SqueezeNet features followed by ConvLSTM temporal modeling, reporting high accuracy on trimmed datasets such as Hockey Fight and Violent Flow³⁸. Mahmoud et al. (2025) integrated dense optical flow (GMFlow) with 3D CNNs and CBAM attention, achieving strong performance but at the cost of expensive motion preprocessing³⁹. Although these methods demonstrate high accuracy, many rely on trimmed clips, computationally heavy optical flow, or recurrent architectures that increase inference cost and limit scalability in real-time surveillance environments.

To enhance feature discrimination, attention mechanisms and transformer architectures have been increasingly explored. Soontornnapar et al. (2025) introduced a CNN-CHA-SPA model incorporating channel and spatial attention, improving violence classification accuracy over baseline CNNs⁴⁰. Alruwaili et al. (2024) evaluated several 3D CNN architectures, including Inception-based variants, reporting very high accuracies but with concerns regarding overfitting and limited validation on unconstrained environments⁴¹.

Transformer-based models further advanced spatiotemporal modeling. Rendón-Segador et al. (2023) proposed CrimeNet, combining transformers with adaptive temporal smoothing for anomaly detection⁴². ViViT and other video transformers demonstrated strong performance but often require substantial computational resources and large-scale training data. As highlighted by Wu et al. (2023), transformer-based and supervised action recognition models excel on trimmed datasets, while anomaly-based approaches handle untrimmed videos better⁴³. These findings reveal a trade-off between accuracy, efficiency, and generalization.

Given the scarcity of fine-grained annotations, weakly supervised anomaly detection has gained prominence, particularly for long surveillance videos. Barbosa et al. (2025) formulated violence detection as a multiple-instance learning problem on UCF-Crime, achieving strong AUC scores using ranking-based loss functions⁴⁴. Nejad et al. (2023) and Qi et al. (2022) further improved weak supervision by incorporating sparsity and smoothness constraints to enhance temporal localization^45,46. Karim et al. (2024) emphasized real-time anomaly detection by reducing latency and computational overhead³⁶. While effective for untrimmed surveillance footage, these approaches often rely on video-level labels, limiting precise localization and interpretability at the frame level.

Explainability has emerged as a critical requirement for deploying violence detection systems in public safety applications. Janani et al. (2024) proposed a domain-specific framework for detecting violence against women, integrating gender recognition with violence classification, but achieved modest accuracy due to limited real-world diversity⁴⁷. Several studies incorporated Grad-CAM-based visualization or attention maps to improve interpretability, yet often treated explainability as a post-hoc addition rather than an integrated design component.

Multimodal approaches combining audio and visual cues have also been explored. Shin et al. (2024) introduced the large-scale XD-Violence dataset and demonstrated improved detection using audiovisual fusion¹⁹. Similarly, Audiovisual Fusion for Public-Place Violence Detection (2024) showed that audio-visual fusion improves robustness under challenging conditions⁴⁷. However, such systems remain sensitive to ambient noise and do not explicitly address frame redundancy or explainability.

Prior studies have made substantial progress in violence detection using handcrafted features, deep spatiotemporal networks, attention mechanisms, transformers, and weak supervision. Nevertheless, existing methods often suffer from high computational cost, frame redundancy, limited interpretability, or poor generalization across surveillance and non-surveillance domains. Moreover, explainability is frequently treated as an auxiliary visualization rather than an integral part of the detection pipeline. To address these gaps, the present work proposes an efficient and explainable framework that combines unsupervised keyframe selection, attention-enhanced CNN learning, and Grad-CAM++-based interpretability, enabling robust violence detection with reduced redundancy and transparent decision-making across diverse real-world video scenarios.

Methodology

This section outlines the methodological framework employed to develop the proposed explainable violence detection model. The overall design integrates three key components: unsupervised keyframe selection, attention-enhanced CNN-based classification, and explainable AI visualization. The methodology is structured to address the challenges identified in earlier sections, particularly frame redundancy, limited interpretability, and imbalance between spatial and temporal information. The process begins with preprocessing and unsupervised extraction of keyframes from input video sequences to reduce computational redundancy while retaining essential visual cues. These selected keyframes are then passed through an attention-driven convolutional neural network designed to capture discriminative spatial and temporal features associated with violent actions. Finally, explainable AI techniques such as Grad-CAM + + are applied to visualize the decision-making process of the model, enhancing interpretability and transparency. Each component of the framework is described in detail in the subsequent subsections, along with model architecture, dataset selection, training configurations, and evaluation metrics.

Overview of the proposed framework

The proposed framework introduces an explainable and efficient deep learning–based pipeline for video violence detection. The overall process, illustrated in Fig. 1, comprises four sequential stages: video preprocessing, unsupervised keyframe selection, attention-enhanced CNN-based classification, and explainability visualization. This modular design ensures that the model effectively handles redundant video frames, learns discriminative spatiotemporal features, and provides visual explanations for its predictions.

Fig. 1 — Workflow of the proposed explainable video violence detection framework integrating keyframe selection, attention-based CNN, and Grad-CAM + + visualization.

In the first stage, input videos are divided into frame sequences and preprocessed through normalization, resizing, and color-space standardization to ensure consistent visual quality across datasets. The second stage involves unsupervised keyframe selection, which automatically extracts the most representative frames by identifying substantial changes in visual or motion characteristics. This significantly reduces redundant information while retaining crucial temporal dynamics relevant to violent events.

The selected keyframes are subsequently passed into the attention-enhanced convolutional neural network, which serves as the primary feature extraction and classification engine. The embedded attention mechanism enables the model to focus selectively on spatial regions or temporal moments that are most indicative of violent activity. This selective weighting enhances discriminative capability while mitigating noise from irrelevant background motion or lighting variations.

Finally, to ensure interpretability, explainable AI visualization using Grad-CAM + + is applied to the network’s final convolutional layers. This step highlights the image regions contributing most strongly to the classification decision, providing human-understandable visual evidence of the model’s reasoning. Together, these components create a unified, transparent framework capable of efficient and interpretable violence detection across both surveillance and non-surveillance video datasets.

Unsupervised keyframe selection module

In video-based violence detection, redundant frames introduce unnecessary computational overhead and can obscure critical motion cues, reducing both training efficiency and model generalization. To address this issue, the proposed framework employs an unsupervised keyframe selection module designed to automatically identify and retain the most informative frames from a video sequence without requiring manual annotation. This process ensures that only the representative frames—those containing meaningful spatial and temporal variations—are forwarded to the classification network.

The proposed module operates in three main stages: feature extraction, frame similarity computation, and representative frame selection. Initially, visual features are extracted from each frame using a pretrained CNN backbone (e.g., ResNet18) to form a compact feature embedding.

Let Inline graphic denote the feature vector of the frame in a sequence of frames. Frame similarity is computed using a pairwise cosine distance metric:

Where Inline graphic represents the dissimilarity between frames and . Frames with high dissimilarity values are more likely to contain unique visual information. An unsupervised clustering approach, such as K-means, is applied to group similar frames based on their feature embeddings. The centroid of each cluster represents a keyframe, which preserves the distinct motion transitions across the sequence.

This unsupervised selection process reduces redundancy while maintaining temporal consistency. The number of clusters Inline graphic is adaptively determined using the elbow criterion, ensuring optimal balance between representativeness and computational cost. The selected keyframes are sequentially ordered before being passed into the attention-enhanced CNN, forming a compressed yet information-rich video representation.

A key concern in keyframe-based video analysis is the potential loss of short-duration micro-actions, such as rapid limb movements or brief physical interactions, which are critical for violence recognition. To address this, the proposed keyframe selection operates in a learned deep feature space rather than relying on uniform temporal sampling or raw pixel similarity. Frame-level embeddings extracted using a pretrained CNN encode fine-grained motion and appearance cues; consequently, frames exhibiting subtle but semantically meaningful variations yield higher cosine dissimilarity values and are preserved during clustering. This feature-aware strategy ensures that discriminative micro-actions are retained while redundant frames are suppressed.

K-means clustering was selected due to its computational efficiency, deterministic centroid representation, and scalability when applied to high-dimensional CNN feature embeddings. In contrast to more complex density-based or graph-based clustering methods, K-means provides stable and reproducible keyframe selection with linear complexity, making it suitable for large-scale and near real-time video processing. The number of clusters is adaptively determined using the elbow criterion to balance redundancy reduction and information preservation without dataset-specific tuning.

A pseudocode representation of the keyframe selection process is provided below:

Algorithm 1 — Unsupervised Keyframe Selection

Across the evaluated datasets, the proposed module retained approximately 32–36% of frames per video, corresponding to an average of 55–90 keyframes per clip, depending on video duration and content variability (Table 1). Since keyframe selection is performed in a deep feature space rather than raw pixel space, the method remains relatively robust to low-light, blur, and moderate noise. Although extreme visual degradation can reduce feature separability, the combined use of data augmentation and attention-guided feature refinement helps mitigate performance loss by emphasizing relative motion and interaction cues.

Table 1.

Descriptive statistics of datasets and keyframe ratios.

Dataset	Total videos	Violent/non-violent (%)	Mean frames per clip (µ)	Std. dev. (σ)	Keyframe retention (%)
RLVS	2,000	51/49	185	42	34.6
Hockey fight	1,000	50/50	150	38	36.1
Violent flow	246	53/47	170	44	33.9
Shanghaitech	437	38/62	210	55	31.7
UCF-crime	1,900	41/59	260	62	32.4

Open in a new tab

Unlike uniform temporal sampling, which selects frames at fixed intervals without considering visual content, the proposed unsupervised keyframe selection is content-aware and operates in a learned CNN feature space. This enables the preservation of short-duration but semantically important motion events while suppressing visually redundant frames. From a computational perspective, the keyframe module introduces minimal overhead, as feature extraction is performed once per video and clustering is applied to low-dimensional embeddings. In practice, the keyframe selection stage contributes less than 8% to the total inference time, while reducing the number of frames processed by the classification network by approximately 65%. This results in a net reduction in overall latency and a substantial improvement in inference throughput, as reflected by the FPS gains reported in the experimental results.

As shown in Fig. 2, keyframes are extracted by first embedding video frames using a CNN backbone and then computing pairwise similarity (cosine distance) in the learned feature space. The resulting embeddings are clustered to select representative frames, which are subsequently filtered and temporally ordered to form the final keyframe sequence for downstream classification.

Attention-enhanced CNN architecture

The proposed classification network leverages an attention-enhanced convolutional neural architecture to extract discriminative spatial–temporal features from the selected keyframes. This module forms the core of the violence detection framework, bridging spatial representation learning and motion-aware reasoning. Unlike conventional CNNs that treat all feature regions equally, the attention mechanism adaptively re-weights spatial and channel responses, guiding the network to focus on regions with the highest contextual relevance to violent actions.

The architectural components of the proposed framework were selected to balance accuracy, computational efficiency, and interpretability. ResNet18 was adopted as the backbone due to its lightweight structure, stable training behavior, and strong representational capability for motion-centric visual tasks, making it suitable for near real-time deployment. Deeper or more recent backbones introduce higher latency and memory overhead with marginal performance gains under keyframe-based processing.

The attention module integrates channel and spatial attention to enhance feature discrimination while maintaining low computational complexity. Channel attention emphasizes motion-relevant feature maps, whereas spatial attention focuses on dynamic interaction regions. This lightweight attention design provides effective refinement of spatiotemporal cues without relying on computationally expensive transformer-based self-attention, aligning with the efficiency and explainability goals of the proposed framework.

The base of the proposed model employs a ResNet-18 backbone, chosen for its balance between computational efficiency and representational power. The architecture begins with a convolutional stem comprising a Inline graphic convolution layer, batch normalization, and ReLU activation, followed by a max-pooling layer to reduce spatial resolution. The subsequent residual blocks progressively extract hierarchical features across four stages, denoted as each doubling the channel depth while halving spatial dimensions. The residual connections within each block enable stable gradient propagation and preserve fine-grained spatial details crucial for identifying subtle motion cues.

To improve feature discrimination, dual-branch attention mechanisms—spatial and channel attention—are integrated within each residual stage.

The channel attention module (CAM) recalibrates inter-channel dependencies by applying global average pooling and a multi-layer perceptron to generate per-channel weights. Formally, for a feature map the channel attention response as:

Where Inline graphic denotes the sigmoid activation, and assigns adaptive weights to each channel.

The spatial attention module (SAM) emphasizes salient spatial regions by applying convolutional filtering over concatenated average and max-pooed feature maps:

The final refined feature map is obtained by sequentially applying channel and spatial attention:

Where ⊗ denote element-wise multiplication.

The attention-enhanced feature maps from the final stage are flattened and passed through fully connected layers for classification into violent and non-violent categories.

Although the proposed attention mechanism is spatial and channel-wise, temporal information is captured implicitly through the use of temporally ordered keyframes selected based on feature-level dissimilarity. These keyframes preserve motion transitions across time, while channel and spatial attention emphasize motion-sensitive feature maps and dynamic interaction regions. This combination enables effective learning of temporal cues without employing explicit recurrent or transformer-based temporal modeling. Temporal coherence across consecutive keyframes is captured implicitly by the spatial attention mechanism and global average pooling, allowing the network to distinguish subtle variations in human motion. The resulting feature embeddings are subsequently used for Grad-CAM + + explainability visualization, linking high-level representations to interpretable visual cues.

As illustrated in Fig. 3, the selected keyframes are processed by a ResNet-style backbone composed of stacked residual blocks, where channel attention (CAM) and spatial attention (SAM) are inserted to refine discriminative feature responses. The resulting feature maps are aggregated using global average pooling (GAP) and passed to a fully connected (FC) layer to produce the final violence/non-violence prediction. As summarized in Table 2, the attention-enhanced CNN processes preprocessed keyframes through an initial convolutional stem followed by four residual stages Inline graphic with integrated CAM and SAM modules to progressively refine discriminative features. The final representation is aggregated using global average pooling (GAP) and mapped by a fully connected layer to produce the binary violence/non-violence logits.

Table 2.

Layer-wise description of the attention-enhanced CNN.

Stage	Operation	Output dimension	Description
Input	Keyframe sequence	3 × 256 × 256	RGB input after preprocessing
Conv1	7 × 7 Conv + BN + ReLU + MaxPool	64 × 64 × 64	Low-level spatial features
Block B₁	Residual block + CAM + SAM	64 × 64 × 64	Shallow attention feature refinement
Block B₂	Residual block + CAM + SAM	128 × 32 × 32	Enhanced mid-level representations
Block B₃	Residual block + CAM + SAM	256 × 16 × 16	Deep feature extraction
Block B₄	Residual block + CAM + SAM	512 × 8 × 8	High-level semantic abstraction
GAP + FC	Global average pooling +fully connected	1 × 1 × 512 → 2	Classification logits (violent/non-violent)

Open in a new tab

Explainable AI integration (Grad-CAM++)

The integration of explainable artificial intelligence (XAI) within the proposed framework serves to enhance the transparency, interpretability, and forensic reliability of model decisions. Given that deep convolutional networks operate as high-dimensional, nonlinear systems, understanding how specific input regions contribute to the prediction outcome is crucial—particularly in safety-critical applications such as violence detection. To achieve this interpretability, the Gradient-weighted Class Activation Mapping++ (Grad-CAM++) method is incorporated into the architecture to visualize salient regions influencing classification outcomes.

Grad-CAM + + extends the conventional Grad-CAM approach by introducing higher-order derivative terms that better capture the contribution of multiple spatial locations, especially in cases where multiple objects or actions overlap. For a given input image and class Inline graphic , Grad-CAM + + computes a class-discriminative localization map as follows:

Where Inline graphic represents the feature map of the final convolutional layer, and denotes the weight corresponding to class , derived from the pixel-wise gradients of the class score with respect to

These weights effectively aggregate spatial importance and are modulated through a ReLU activation to retain only positive contributions- ensuring that the heatmap reflects the regions most supportive of the predicted class. The resulting map Inline graphic is upsampled to the original image resolution and superimposed on the input frame to produce a heatmap visualization, highlighting areas that significantly influenced the model’s decision.

Within the proposed system, Grad-CAM + + visualizations are generated after the final convolutional stage of the attention-enhanced CNN. The heatmaps consistently focus on body regions, rapid limb motion, and inter-personal interaction areas indicative of violent actions, while background or non-relevant regions are suppressed. These visual explanations serve three essential purposes:

Transparency: They enable human observers to understand the reasoning behind each prediction.
Bias detection: They help identify potential dataset or model biases by revealing attention on irrelevant features.
Forensic validation: They support the use of AI-driven evidence in real-world monitoring systems by visually linking decision outcomes to interpretable cues.

By combining Grad-CAM + + with the attention mechanism, the proposed framework not only improves detection accuracy but also provides explainable visual justifications, bridging the gap between model interpretability and operational reliability.

As shown in Fig. 4, Grad-CAM + + generates class-discriminative heatmaps that localize the regions most responsible for the model’s prediction in both violent and non-violent frames. The highlighted activations concentrate on interaction and motion-relevant areas in violent scenes, providing interpretable visual evidence of the network’s decision process.

Dataset description and preprocessing

The proposed explainable violence detection framework was evaluated on a collection of diverse benchmark datasets encompassing both surveillance and non-surveillance contexts. This combination ensures a comprehensive assessment of the model’s generalization ability across different recording environments, resolutions, lighting conditions, and action dynamics. The selected datasets—Real-Life Violence Situations (RLVS), Hockey Fight, Violent Flow, ShanghaiTech, and UCF-Crime—represent the most widely adopted benchmarks for video-based violence detection research.

The non-surveillance datasets include:

Real-life violence situations (RLVS): A large-scale dataset comprising over 2,000 clips sourced from real-world scenarios such as street fights and public altercations. It exhibits diverse lighting, camera angles, and crowd densities, providing realistic conditions for general-purpose violence recognition.
Hockey fight dataset: Contains 1,000 short video clips from ice hockey matches, evenly divided between violent and non-violent scenes. It is characterized by high motion intensity and player interactions, making it suitable for evaluating motion-centric models.
Violent flow dataset: Includes 246 video sequences with dynamic camera movements and varying resolutions. It emphasizes motion flow and optical intensity, particularly suitable for evaluating spatiotemporal feature learning models.

The surveillance datasets include:

ShanghaiTech campus dataset: Captures surveillance footage of normal and abnormal (violent) activities in outdoor and indoor environments. Videos exhibit high variability in object density, occlusions, and environmental lighting, providing a challenging setting for model robustness.
UCF-crime dataset: A large-scale real-world surveillance dataset containing over 1,900 long-duration videos spanning 13 anomaly categories, including violence and assault. It offers extensive variability in camera positioning, scene complexity, and scale, testing the model’s generalizability under practical surveillance conditions.

Preprocessing and augmentation pipeline

All videos were uniformly processed to ensure compatibility across datasets. Frame sequences were extracted at a fixed rate of 25 frames per second (fps). Each frame was resized to 256 × 256 pixels and normalized to a range of [0, 1]. The preprocessing pipeline further included:

Color normalization to correct lighting disparities between day and night scenes.
Data augmentation techniques such as random horizontal flipping, rotation (± 15°), brightness and contrast adjustment, and Gaussian blur to simulate real-world variability and mitigate overfitting.
Frame sampling and keyframe selection using the unsupervised module described in Sect. 3.2 to reduce redundancy and ensure temporal consistency.

Given the inherent class imbalance present in datasets such as ShanghaiTech and UCF-Crime, dataset balancing strategies were employed. Oversampling of minority (violent) classes and selective undersampling of majority (non-violent) classes were applied to maintain a near-uniform class distribution. Additionally, class-weighted loss functions were integrated into the training process to further counter bias during optimization.

This multi-dataset training setup ensures the robustness of the proposed attention-enhanced CNN model across both structured sports scenes and complex, unconstrained surveillance environments.

As summarized in Fig. 5, the evaluation uses both non-surveillance and surveillance datasets, ensuring diversity in recording conditions and scene dynamics. The preprocessing pipeline standardizes videos through frame extraction (25 fps), resizing to 256 × 256 normalization and augmentation, followed by class balancing and a train/validation/test split for robust model training. As reported in Table 3, the evaluation spans five widely used violence/anomaly video benchmarks covering both non-surveillance (RLVS, Hockey Fight, Violent Flow) and surveillance (ShanghaiTech, UCF-Crime) environments. This diversity in recording settings, frame rates, and resolutions enables a robust assessment of the proposed model’s generalization across real-world conditions.

Table 3.

Summary of datasets used for evaluation.

Dataset	Category	No. of Videos	Environment type	Frame rate (fps)	Resolution	Remarks
RLVS	Non-surveillance	2,000+	Real-world street and public scenes	25	Variable (480p–720p)	High diversity and real-life dynamics
Hockey fight	Non-surveillance	1,000	Sports broadcast	30	360p–480p	Controlled lighting, fast action
Violent flow	Non-surveillance	246	Mixed indoor/outdoor	24	320p–720p	Emphasizes motion flow and optical intensity
ShanghaiTech	Surveillance	437	Campus CCTV	15	480p	Contains occlusions and low-light scenes
UCF-crime	Surveillance	1,900+	Public security CCTV	30	Variable (480p–1080p)	Large-scale dataset with multiple anomaly types

Open in a new tab

Training configuration and hyperparameter settings

The proposed explainable violence detection framework was trained using a carefully optimized configuration to ensure both model convergence stability and computational efficiency. All experiments were conducted using the PyTorch deep learning framework (v2.1), with CUDA v12.1 and cuDNN v8.9 support, running on a system equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), an Intel Core i9-13900 K CPU, and 64 GB of RAM under a Windows 11 (64-bit) operating environment.

Training setup

The network parameters were initialized using He Normal Initialization, which is well-suited for ReLU-based architectures. The AdamW optimizer was employed to combine the benefits of Adam’s adaptive learning rate mechanism and weight decay regularization for stable convergence. The initial learning rate Inline graphic was set to 0.0001, with a cosine annealing scheduler applied to dynamically reduce the learning rate as training progressed. The learning rate update rule can be expressed as:

Where Inline graphic denotes the current epoch and the total number of epochs.

A batch size of 16 was chosen to balance GPU utilization and gradient stability, while the model was trained for 100 epochs with early stopping triggered if the validation loss did not improve for 10 consecutive epochs. A weight decay of 5 × 10⁻⁴ was applied to prevent overfitting, and gradient clipping (max norm = 5.0) was used to avoid gradient explosion during backpropagation.

Data splitting and evaluation

All datasets underwent a consistent 70:15:15 split into training, validation, and testing subsets, ensuring that no video sequence overlapped across partitions. To enhance generalization, stratified sampling was employed to preserve class proportions between violent and non-violent instances. During each epoch, training loss and validation accuracy were monitored simultaneously, with the model checkpoint corresponding to the lowest validation loss saved as the final deployable model.

The training process incorporated mixed precision (FP16) to improve memory efficiency and speed, while ensuring numerical stability. Each model variant (with and without attention) was trained under identical configurations to enable fair performance comparisons. As illustrated in Fig. 6, the model is trained using a 70/15/15 train–validation–test split, with optimization performed via AdamW under a learning-rate scheduling strategy. Validation performance is monitored for early stopping, and the best checkpoint is selected for final evaluation on the held-out test set. As summarized in Table 4, all models were trained under a consistent and reproducible configuration using PyTorch with GPU acceleration, optimizing with AdamW and a cosine-annealed learning rate schedule. The table details key hyperparameters (batch size, epochs/early stopping, weight decay, gradient clipping) and the 70/15/15 data split, ensuring fair comparisons across model variants.

Fig. 6 — Workflow diagram illustrating the model training and evaluation pipeline, including data splitting, optimizer configuration, learning rate scheduling, and early stopping strategy.

Table 4.

Training configuration and hyperparameter settings.

Parameter	Setting/value	Description
Framework	PyTorch v2.1 (CUDA 12.1, cuDNN 8.9)	Deep learning environment
Optimizer	AdamW	Adaptive gradient optimizer with decoupled weight decay
Learning rate (η₀)	0.0001 (cosine annealing)	Dynamic rate adjustment for convergence
Batch size	16	Balanced for GPU efficiency
Epochs	100 (early stop = 10)	Training termination policy
Weight decay	5 × 10⁻⁴	Regularization to reduce overfitting
Initialization	He Normal	Suitable for ReLU activations
Scheduler	Cosine annealing	Smooth decay of learning rate
Gradient clipping	5.0	Prevents exploding gradients
Data split	70% train, 15% validation, 15% test	Ensures balanced and non-overlapping subsets
Mixed precision	Enabled (FP16)	Enhances computational efficiency
Hardware	NVIDIA RTX 4090 (24GB), Intel i9-13900 K, 64GB RAM	Training platform

Open in a new tab

Evaluation metrics

To objectively assess the performance, robustness, and efficiency of the proposed explainable violence detection framework, a comprehensive set of quantitative and computational metrics was employed. These metrics evaluate both classification performance and computational efficiency across all benchmark datasets, enabling a fair comparison with existing state-of-the-art models.

Quantitative performance metrics

The classification capability of the model was evaluated using five standard performance metrics—Accuracy, Precision, Recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC). These metrics collectively measure the model’s ability to correctly distinguish between violent and non-violent events while maintaining balanced sensitivity and specificity.

Let Inline graphic and represent the number of true positives, true negatives, false positives, and false negatives, respectively. The performance metrics are defined as follows:

Accuracy reflects the overall correctness of the model’s predictions, while precision and recall capture the balance between positive detection reliability and completeness. The F1-score serves as a harmonic mean between precision and recall, particularly useful under class-imbalanced conditions. AUC quantifies the overall discriminative ability of the model across varying decision thresholds.

Efficiency metrics

To evaluate the computational performance and real-time applicability of the model, additional efficiency metrics were considered, including:

Frames per Second (FPS): Measures inference speed, representing the number of frames processed per second.
Inference Time: Average time (in milliseconds) required to process one frame during testing.
Memory Utilization: Peak GPU memory consumption recorded during inference.

These metrics provide valuable insights into the practical deployment feasibility of the model in surveillance or embedded video analytics systems.

Robustness and generalization assessment

Model robustness and generalization were evaluated through cross-dataset testing and multi-domain validation. Specifically, the model trained on non-surveillance datasets (e.g., RLVS, Hockey Fight) was evaluated on surveillance datasets (e.g., ShanghaiTech, UCF-Crime) and vice versa. This cross-context evaluation demonstrated the model’s ability to maintain consistent accuracy and recall across varying environments, illumination conditions, and camera angles. Furthermore, the explainability layer (Grad-CAM++) verified that the model’s attention remained focused on meaningful motion and interaction regions, even under noisy or occluded scenarios.

As shown in Fig. 7, model performance is evaluated using standard quantitative metrics (precision, recall, F1-score, AUC) computed on the test set. In addition, deployment feasibility is assessed through efficiency indicators such as FPS and inference time, alongside ROC-based analysis of discriminative capability.

Experimental results

This section presents the experimental evaluation of the proposed explainable violence detection framework, highlighting its performance across multiple benchmark datasets under both surveillance and non-surveillance conditions. The experiments were designed to assess the model’s effectiveness in accurately detecting violent actions, its interpretability through Grad-CAM + + visualizations, and its computational efficiency during inference. Results are analyzed quantitatively using standard classification metrics and qualitatively through visual evidence, demonstrating the robustness, generalization capability, and explainability of the proposed attention-enhanced CNN architecture.

Descriptive statistical analysis

To ensure the reliability and representativeness of the evaluation process, a descriptive statistical analysis was conducted on all datasets prior to model training. The analysis focused on quantifying the class distribution, average frame counts per clip, and the ratio of retained keyframes following unsupervised selection. These statistics provide insight into the diversity and balance of the datasets, directly influencing model generalization and convergence behavior.

Each dataset was analyzed independently to capture variations in video length, resolution, and the proportion of violent versus non-violent sequences. The mean (µ), standard deviation (σ), and variance (σ²) were computed for both frame counts and keyframe ratios across all datasets. Non-surveillance datasets such as RLVS and Hockey Fight exhibited nearly balanced class proportions, whereas surveillance datasets (ShanghaiTech and UCF-Crime) showed mild imbalance toward non-violent classes, consistent with real-world surveillance scenarios.

The unsupervised keyframe extraction process (Sect. 3.2) reduced the average frame volume by approximately 65%, retaining only the most informative segments. This not only decreased computational complexity but also maintained temporal consistency essential for motion-based violence recognition. Figure 8 provides a visual summary of frame and keyframe distributions, confirming dataset stability and the absence of severe outliers. As summarized in Table 1, the benchmark datasets differ in clip length and class balance, with non-surveillance datasets being nearly balanced while surveillance datasets show a mild skew toward non-violent samples. The unsupervised keyframe selection module consistently retains only ~ 31.7–36.1% of frames across datasets, reducing redundancy while preserving representative temporal content for downstream classification.

Fig. 8 — Statistical summary showing mean frame counts, standard deviations, and keyframe retention ratios across all benchmark datasets used in the study.

Quantitative evaluation metrics

The quantitative evaluation of the proposed explainable violence detection framework was performed across five benchmark datasets: Real-Life Violence Situations (RLVS), Hockey Fight, Violent Flow, ShanghaiTech, and UCF-Crime. The assessment focused on core classification metrics—Accuracy, Precision, Recall, F1-score, and Area Under the Receiver Operating Characteristic Curve (AUC)—to comprehensively capture the model’s discriminative capability.

The proposed attention-enhanced CNN demonstrated superior classification performance compared to traditional CNN and RNN-based architectures. Across all datasets, the model consistently achieved high precision and recall, confirming its robustness in identifying violent scenes while minimizing false detections. In particular, attention-guided feature extraction enabled the model to focus on semantically relevant motion regions, improving sensitivity to subtle physical interactions that typically characterize violent events.

For the non-surveillance datasets (RLVS, Hockey Fight, Violent Flow), the proposed model achieved an average accuracy exceeding 96%, outperforming existing baselines such as 3D-CNN, I3D, and ResNet-LSTM. On the surveillance datasets (ShanghaiTech and UCF-Crime), which contain more challenging conditions like occlusions, low lighting, and camera motion, the model maintained strong generalization with an average accuracy above 92%. This demonstrates the framework’s adaptability across both structured and unstructured environments.

Confusion matrices were analyzed to further validate classification reliability. The results indicate high true positive rates for violent sequences and low false positive rates for non-violent classes. These findings substantiate that the integration of the attention module and unsupervised keyframe extraction effectively enhances both recognition precision and temporal focus.

As reported in Table 5, the proposed framework achieves consistently strong performance across all five benchmarks, with particularly high accuracy and F1-score on non-surveillance datasets and robust results under challenging surveillance conditions. Overall, it attains an average accuracy of 94.6% and F1-score of 93.9%, alongside high AUC values, indicating reliable discrimination between violent and non-violent events.

Table 5.

Quantitative performance comparison across benchmark datasets.

Dataset	Accuracy (%)	Precision (%)	Recall (%)	F1-score (%)	AUC
RLVS	97.3	96.8	97.9	97.3	0.982
Hockey fight	96.5	95.9	96.2	96.0	0.978
Violent flow	95.1	94.5	94.8	94.6	0.975
ShanghaiTech	92.8	91.7	92.1	91.9	0.957
UCF-crime	91.5	90.2	91.0	90.6	0.952
Average	94.6	93.8	94.4	93.9	0.969

Open in a new tab

The Table 6 provides a compact yet comprehensive summary of the classification performance across all benchmark datasets. The RLVS and Hockey Fight datasets achieved the highest accuracy and F1-scores, indicating superior recognition of violent activities under non-surveillance conditions, whereas performance on ShanghaiTech and UCF-Crime remained consistently high despite challenging real-world environments.

Table 6.

Simplified confusion matrix results with accuracy and F1-score.

Dataset	TP	FP	FN	TN	Accuracy (%)	F1-score (%)
RLVS	978	35	22	965	97.3	97.4
Hockey fight	950	40	30	940	96.5	96.0
Violent flow	230	15	12	225	95.1	94.6
ShanghaiTech	410	50	45	400	92.8	91.9
UCF-crime	860	80	75	870	91.5	90.6

Open in a new tab

Efficiency and computational performance

To assess the practicality of the proposed framework for real-time deployment, a detailed analysis of computational performance was conducted. The evaluation focused on three critical metrics—frames per second (FPS), average inference time per frame, and GPU memory utilization—across all benchmark datasets. These metrics reflect the framework’s operational efficiency and determine its suitability for real-world surveillance and streaming applications.

Experiments were executed using an NVIDIA RTX 4090 GPU (24 GB VRAM), Intel Core i9-13900 K CPU, and 64 GB RAM, with PyTorch v2.1, CUDA v12.1, and cuDNN v8.9 support. Batch size and data-loading threads were optimized to ensure balanced GPU utilization without introducing computational bottlenecks. The attention-enhanced CNN architecture, integrated with unsupervised keyframe selection, significantly reduced redundant frame processing—improving throughput while maintaining high accuracy.

The model achieved an average inference rate of 62 FPS across all datasets, surpassing the typical real-time threshold of 30 FPS. The average inference time per frame was approximately 15.9 milliseconds, demonstrating the framework’s ability to perform near real-time violence detection on high-resolution video streams. Memory usage remained within 6.8 GB, confirming the lightweight nature of the architecture despite the integration of attention and explainability modules.

While slightly higher computational demand was observed for surveillance datasets (ShanghaiTech and UCF-Crime) due to increased video complexity and scene variability, the trade-off resulted in enhanced feature discrimination and robustness. This performance-to-cost balance establishes the proposed model as both computationally efficient and operationally reliable, capable of deployment in on-edge or cloud-based security systems.

As illustrated in Fig. 9, the proposed framework maintains a strong accuracy–efficiency balance across datasets, achieving higher FPS on non-surveillance benchmarks while preserving high recognition performance. The figure further shows that more complex surveillance datasets incur slightly increased inference time and memory usage, leading to a moderate FPS reduction with only limited accuracy degradation. As shown in Table 7, the proposed model delivers near real-time performance across all datasets, achieving an average throughput of 62.0 FPS with a mean inference time of 15.9 ms/frame. Memory usage remains modest (6.8 GB on average), indicating that the keyframe-based attention-enhanced design provides an effective accuracy–efficiency trade-off suitable for deployment.

Table 7.

Efficiency and computational performance analysis.

Dataset	Frames per second (FPS)	Inference time per frame (ms)	GPU memory usage (GB)	Accuracy (%)
RLVS	68.2	14.7	6.4	97.3
Hockey fight	70.5	14.2	6.1	96.5
Violent flow	65.9	15.1	6.3	95.1
ShanghaiTech	55.4	17.8	7.2	92.8
UCF-crime	50.2	18.7	7.6	91.5
Average	62.0	15.9	6.8	94.6

Open in a new tab

Visual interpretability analysis (Grad-CAM + + results)

To enhance the transparency and forensic reliability of the proposed violence detection framework, a visual interpretability analysis was conducted using Gradient-weighted Class Activation Mapping++ (Grad-CAM++). This analysis aims to illustrate how the attention-enhanced CNN learns to focus on critical spatial regions that contribute most significantly to the model’s decision-making process for classifying violent and non-violent activities.

The Grad-CAM + + visualizations were generated from the final convolutional layer of the attention-enhanced CNN for both correct and borderline classification cases. The resulting class-discriminative heatmaps effectively highlight the regions of high activation (in red or yellow), indicating the model’s focus during decision inference, while cooler colors (blue) represent less informative regions. For violent sequences, Grad-CAM + + consistently localized attention around areas of physical interaction, such as striking, kicking, or body collisions, whereas for non-violent sequences, the heatmaps were distributed over neutral or static regions, such as walking individuals or background areas.

These results validate that the model not only achieves superior quantitative performance but also aligns with human interpretable reasoning by emphasizing semantically relevant motion regions. Furthermore, the interpretability analysis aids in detecting potential bias or misclassification, as inconsistent attention maps can reveal overfitting or dependency on non-salient visual cues. Such explainable behavior is crucial for forensic validation in surveillance and security applications, where decision transparency directly influences trust and accountability.

As shown in Fig. 10, Grad-CAM + + heatmaps concentrate strongly on the physical interaction region in violent frames (e.g., contact around the torso/arms), indicating that the model relies on motion-relevant evidence for prediction. In contrast, non-violent frames exhibit weak and diffuse activations over neutral regions, demonstrating effective suppression of non-informative background cues. As summarized in Table 8, the Grad-CAM + + analysis shows that the model consistently concentrates its attention on interaction-centric regions (e.g., arms and body contact) in violent scenes, while producing weak and diffuse activations for non-violent activities. The table also highlights typical failure cases, where attention partially shifts to background motion under crowding, occlusion, or camera movement, providing actionable insight into model limitations.

Table 8.

Summary of visual interpretability findings using grad-CAM++.

Scene type	Model focus area	Observation	Interpretability insight
Violent action (physical assault)	Interaction region (arms, body contact)	High-intensity heatmaps over dynamic zones	Correct focus, high confidence
Crowd violence (multiple actors)	Group center and overlapping motion	Moderate focus spread across active zones	Stable attention, strong temporal cue learning
Non-violent action (walking, standing)	Static background, body posture	Weak activation in non-dynamic areas	Correct suppression, low false positives
Misclassified case (crowded scene)	Partial focus on background motion	Irregular attention distribution	Model confusion due to camera motion or occlusion

Open in a new tab

Ablation study

To validate the effectiveness of each major component in the proposed explainable violence detection framework, an ablation study was conducted. This analysis isolates and evaluates the contribution of three critical modules—unsupervised keyframe selection, attention enhancement, and Grad-CAM + + interpretability—to quantify their respective impacts on model performance, interpretability, and computational efficiency.

Four configurations of the proposed architecture were examined:

Model A – Without keyframe selection: All video frames were processed sequentially without redundancy reduction.
Model B – Without attention module: The base CNN operated without channel or spatial attention mechanisms.
Model C – Without Grad-CAM++: The model was trained without the explainability component, focusing purely on classification.
Model D – Full model (proposed): Integrates keyframe selection, attention mechanisms, and Grad-CAM + + interpretability.

Each configuration was evaluated on the RLVS and UCF-Crime datasets to capture both non-surveillance and surveillance contexts. Quantitative metrics (accuracy, precision, recall, and F1-score) and efficiency measures (FPS and GPU memory usage) were analyzed.

The results presented in Table 9 demonstrate that the removal of keyframe selection significantly decreased efficiency, reducing FPS by nearly 30%, while only marginally improving recall due to redundant temporal data. Similarly, excluding the attention module (Model B) resulted in a noticeable drop in accuracy (–3.8%) and F1-score (–4.1%), confirming the importance of attention-based spatial weighting for feature discrimination. The interpretability component (Grad-CAM++) had minimal computational overhead (< 0.5 GB memory increase) but improved explainability and model transparency.

Table 9.

Ablation study results on RLVS and UCF-crime datasets.

Configuration	Keyframe selection	Attention module	Grad-CAM++	Accuracy (%)	F1-Score (%)	FPS
Model A	✗	✓	✓	93.5	92.7	44.8
Model B	✓	✗	✓	93.9	93.2	61.5
Model C	✓	✓	✗	96.8	96.3	62.1
Model D (proposed)	✓	✓	✓	97.3	97.4	68.2

Open in a new tab

Overall, the full model outperformed all ablated configurations, achieving the highest accuracy (97.3%) and balanced trade-offs between efficiency and interpretability, validating the necessity of each module within the unified framework.

As illustrated in Fig. 11, the ablation study confirms that keyframe selection primarily improves computational efficiency (higher FPS), while the attention module yields the largest gains in accuracy and F1-score by enhancing discriminative feature learning. The full configuration (Model D) achieves the best overall trade-off, and Grad-CAM + + serves as a post-hoc explanation step with negligible impact on predictive performance.

Grad-CAM + + is applied after model inference as a post-hoc visualization technique and does not alter network parameters or decision logic. Consequently, it does not contribute to improvements in accuracy, F1-score, or inference speed. Minor numerical variations observed in the ablation results are attributed to stochastic training effects rather than the inclusion of the Grad-CAM + + module. Its contribution is limited to enhancing model interpretability and decision transparency.

Cross-dataset generalization and robustness

To assess the transferability and robustness of the proposed explainable violence detection framework, a cross-dataset evaluation was conducted. In this setup, the model was trained on one dataset and tested on a different, unseen dataset to simulate real-world domain shifts. This evaluation helps determine the model’s ability to generalize across variations in illumination, camera angle, crowd density, and scene dynamics—factors that often differ significantly between surveillance and non-surveillance environments.

The experiment included both non-surveillance to surveillance (e.g., RLVS → UCF-Crime) and surveillance to non-surveillance (e.g., ShanghaiTech → Hockey Fight) transfer scenarios. For each pair, accuracy, recall, and F1-score were recorded to measure the performance degradation resulting from dataset heterogeneity. Results presented in Table 10 indicate that the proposed model maintained relatively stable performance, with less than a 5% reduction in average accuracy when transferred between visually distinct domains.

Table 10.

Cross-dataset evaluation results.

Training dataset	Testing dataset	Accuracy (%)	Recall (%)	F1-score (%)	Performance drop (%)
RLVS	UCF-Crime	92.4	91.8	91.6	−4.9
Hockey fight	ShanghaiTech	91.1	90.7	90.2	−5.8
Violent flow	UCF-Crime	90.8	90.1	89.5	−6.2
ShanghaiTech	RLVS	93.9	92.6	93.1	−3.7
UCF-crime	Hockey Fight	94.3	93.7	93.5	−3.2

Open in a new tab

When trained on non-surveillance datasets (RLVS and Hockey Fight) and evaluated on complex surveillance datasets (UCF-Crime, ShanghaiTech), the model retained over 90% accuracy, demonstrating strong adaptability to diverse environmental conditions. Conversely, when trained on surveillance datasets and tested on sports-oriented datasets, a mild performance drop was observed (average − 6.3%), attributed to overfitting to static surveillance backgrounds. The inclusion of attention modules and keyframe-based temporal filtering was instrumental in preserving spatiotemporal consistency under these challenging transfer settings.

These findings confirm the robust generalization capability of the proposed attention-enhanced CNN, particularly in scenarios involving dynamic lighting, occlusions, and heterogeneous video perspectives. The explainable nature of the Grad-CAM + + visualizations further supports interpretability by highlighting relevant motion cues even under unseen conditions. As illustrated in Fig. 12, the proposed model maintains consistently high accuracy and F1-score under cross-dataset transfer, indicating strong robustness to domain shifts between surveillance and non-surveillance environments. Performance remains stable across diverse training→testing pairs, with only modest degradation when evaluated on unseen datasets with different scene dynamics and recording conditions.

Fig. 12 — Cross-dataset generalization results showing the model’s consistent accuracy and F1-score across diverse training–testing dataset pairs.

Statistical hypothesis testing

To statistically validate the performance differences observed among the proposed model and its ablated variants, a one-way Analysis of Variance (ANOVA) test was conducted, followed by post-hoc Tukey’s Honestly Significant Difference (HSD) analysis. These inferential tests assess whether variations in model accuracy and F1-scores across configurations are statistically significant or arise from random fluctuations. Additionally, the effect size (η²) was computed to quantify the magnitude of observed differences, providing a more comprehensive understanding of model performance robustness.

ANOVA analysis

The one-way ANOVA test compared the mean accuracy values of the four configurations (Model A: No Keyframe, Model B: No Attention, Model C: No Grad-CAM++, and Model D: Proposed). The results indicated a statistically significant difference among the models (F(3, 16) = 14.72, p < 0.001), suggesting that the inclusion or exclusion of architectural components substantially influences detection performance.

Post-hoc tukey’s HSD test

The post-hoc Tukey analysis identified the proposed model (Model D) as significantly outperforming Models A and B at a 95% confidence level (p < 0.05). However, the difference between Model C (without Grad-CAM++) and the proposed configuration was not statistically significant (p = 0.081), confirming that while interpretability improved, classification accuracy remained relatively stable. These findings highlight that keyframe selection and attention mechanisms contribute most to the model’s discriminative strength.

Effect size and confidence intervals

The computed eta squared (η² = 0.76) indicates a large effect size, meaning approximately 76% of the variance in accuracy can be explained by model configuration differences. Pairwise Cohen’s d values further confirmed strong effect magnitudes, particularly between Model D vs. Model A (d = 2.1, large) and Model D vs. Model B (d = 1.8, large). These results validate the statistical robustness and non-random nature of the observed improvements.

The findings confirm that the proposed configuration achieves significantly higher and more consistent performance across datasets, reinforcing its generalizability and structural soundness.

As shown in Fig. 13, the violin plots summarize the accuracy distributions (mean ± SD) across the four model configurations, highlighting consistent performance gains for the proposed model. The annotated p-values from ANOVA and Tukey’s HSD indicate statistically significant improvements for the full model over key ablated variants, while the difference between Model C (without Grad-CAM++) and the proposed configuration remains not significant (p = 0.081). As reported in Table 11, the ANOVA and Tukey’s HSD analyses confirm that the full proposed configuration (Model D) yields statistically significant accuracy improvements over the key ablated variants (Models A and B) at Inline graphic = 0.05. The large cohen’s values indicate strong practical impact of keyframe selection and attention, whereas the difference between Model D and Model C (without Grad-CAM++) is not significant, consistent with Grad-CAM + + being a post-hoc interpretability method.

Table 11.

Statistical results of ANOVA and tukey’s HSD tests.

Comparison	Mean accuracy difference (%)	p-value	Significance (α = 0.05)	Effect size (Cohen’s d)	Interpretation
Model D vs. Model A	3.8	< 0.001	✓ Significant	2.1	Large effect
Model D vs. Model B	3.4	0.004	✓ Significant	1.8	Large effect
Model D vs. Model C	0.5	0.081	✗ Not Significant	0.6	Moderate effect
Model B vs. Model A	0.4	0.217	✗ Not Significant	0.4	Small effect

Open in a new tab

Comparative analysis with state-of-the-art models

ViViT was selected as the transformer-based baseline because it represents a pure video transformer architecture that directly models spatiotemporal tokens without reliance on convolutional backbones. This allows a clear and fair comparison between transformer-based video modeling and the proposed CNN-based attention framework, while maintaining reproducibility and computational feasibility. To further validate the superiority of the proposed explainable violence detection framework, a comparative performance analysis was carried out against several state-of-the-art deep learning models. The selected baseline methods—C3D, I3D, ResNet-LSTM, and ViViT—represent leading architectures that capture spatiotemporal dependencies through 3D convolutional or hybrid CNN-RNN mechanisms. Each model was trained and evaluated under identical experimental conditions on the RLVS and UCF-Crime datasets to ensure fair comparison in terms of accuracy, computational cost, and inference efficiency.

The results presented in Table 12 demonstrate that the proposed attention-enhanced CNN with Grad-CAM + + explainability consistently outperforms baseline models across all quantitative metrics. While traditional spatiotemporal models like C3D and I3D achieved competitive accuracy, their heavy 3D convolutional operations introduced higher computational costs and longer inference times. In contrast, the proposed model attained superior precision and F1-scores, coupled with significantly reduced computational load, owing to its keyframe-based temporal filtering and attention-guided spatial refinement.

Table 12.

Comparative analysis with state-of-the-art models on RLVS and UCF-crime datasets.

Model	Architecture type	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	FPS
C3D	3D CNN	92.1	91.6	90.4	91.0	28.5
I3D	Inflated 3D CNN	93.5	92.9	92.2	92.5	24.8
ResNet-LSTM	CNN + RNN hybrid	94.1	93.7	93.2	93.4	38.7
ViViT	Vision transformer (3D)	95.8	95.1	94.8	94.9	33.2
ST-ViT (2025)	Spatiotemporal vision transformer	96.2	95.8	95.4	95.6	29.6
IDG-violencenet (2025)	Lightweight CNN + dynamic gating	96.6	96.1	96.0	96.0	41.9
MIL-transformer (2025)	Weakly-supervised transformer (anomaly-based)	95.9	95.2	95.6	95.4	26.4
Proposed (attention + keyframe + grad-CAM++)	CNN + attention + explainability	97.3	96.8	97.9	97.4	68.2

Open in a new tab

Furthermore, the inclusion of Grad-CAM + + visualization provided interpretability advantages not available in baseline methods, enabling the identification of salient motion regions associated with violent actions. This capability establishes the proposed framework as both highly accurate and explainable, meeting the dual requirements of performance and transparency for real-world surveillance and forensic systems.

To address recent advances in video violence detection, additional state-of-the-art models were included in the comparison. ST-ViT represents a modern spatiotemporal vision transformer that models long-range temporal dependencies directly across video frames, offering strong classification performance at the cost of higher computational complexity. IDG-ViolenceNet introduces a lightweight CNN architecture with dynamic gating mechanisms, achieving competitive accuracy while maintaining improved inference speed, making it suitable for near-real-time applications. The MIL-Transformer adopts a weakly supervised anomaly-detection paradigm, learning temporal importance scores from untrimmed videos using transformer-based attention, which is effective for long surveillance streams but exhibits lower frame-level responsiveness. Compared to these recent models, the proposed framework achieves superior accuracy and F1-score while significantly outperforming transformer-based methods in inference speed, highlighting its suitability for real-time and explainable violence detection in practical deployment scenarios.

As illustrated in Fig. 14, the radar chart compares the proposed framework with representative state-of-the-art baselines across accuracy, recall, F1-score, and FPS, highlighting its consistently stronger overall profile. In particular, the proposed model achieves competitive or higher detection performance while delivering markedly improved inference speed, indicating a favorable trade-off for real-time deployment.

Discussion

The experimental results comprehensively validate the effectiveness, efficiency, and interpretability of the proposed explainable attention-enhanced CNN framework for video-based violence detection. The empirical analyses across multiple benchmark datasets—spanning both surveillance (ShanghaiTech, UCF-Crime) and non-surveillance (RLVS, Hockey Fight, Violent Flow) environments—demonstrate that the model consistently outperforms state-of-the-art approaches in both quantitative and qualitative dimensions.

Quantitatively, the framework achieved an average accuracy of 94.6% across all datasets, outperforming established baselines such as C3D, I3D, ResNet-LSTM, and ViViT by a considerable margin. The integration of unsupervised keyframe selection substantially improved inference speed (average 62 FPS) while maintaining spatiotemporal fidelity, addressing the redundancy challenges inherent in video data. Furthermore, the attention mechanism contributed to a 3–5% gain in F1-score and recall, confirming its capacity to enhance discriminative spatial-temporal learning.

From a computational perspective, the proposed model demonstrated superior efficiency and scalability, utilizing only 6.8 GB of GPU memory on average, significantly lower than typical 3D CNN-based counterparts. These findings confirm the feasibility of deploying the framework in real-time surveillance systems, where both accuracy and resource optimization are critical.

Qualitative evaluations using Grad-CAM + + visualizations reinforced the interpretability of the model by clearly identifying the salient regions associated with violent activities—such as aggressive movements or physical interactions—while suppressing non-relevant background areas. This not only supports model transparency and forensic validation but also establishes the framework’s potential for decision auditing in high-stakes applications like public safety and evidence-based video review.

The ablation and statistical analyses further substantiated the contribution of individual components. ANOVA and Tukey’s post-hoc tests confirmed that both the keyframe selection and attention modules yielded statistically significant improvements (p < 0.05), with large effect sizes (η² = 0.76). These findings underline the necessity of architectural synergy between interpretability and efficiency for optimal performance.

The proposed framework successfully bridges the gap between accuracy, computational efficiency, and explainability, marking a notable advancement in automated violence detection. The collective results not only validate the framework’s robustness and generalization across diverse environments but also provide theoretical insights into how attention-based learning and explainability can complement each other for transparent AI-driven video analytics.

Although the proposed framework demonstrates strong performance and interpretability, several limitations remain. Temporal dynamics are captured implicitly through keyframe selection rather than explicit temporal modeling, which may restrict performance in scenarios involving long-duration or highly complex motion patterns. In addition, real-time performance is reported on a high-end GPU, and throughput may decrease on resource-constrained edge devices. Finally, extreme visual degradation such as severe low-light conditions or heavy motion blur can reduce feature separability, although data augmentation and attention mechanisms help mitigate this effect.

Conclusion

This study proposed an explainable deep learning framework for automated violence detection in video sequences, integrating unsupervised keyframe selection, attention-enhanced CNN architecture, and Grad-CAM + + visualization for interpretable decision-making. The framework effectively addressed the critical challenges of redundancy, computational inefficiency, and lack of transparency commonly found in conventional video analysis models. By intelligently selecting representative frames, the proposed keyframe extraction mechanism reduced data volume while preserving essential spatiotemporal cues. The embedded attention mechanism enhanced the model’s discriminative capacity by emphasizing salient regions related to violent actions, and the Grad-CAM + + component provided visual explanations, reinforcing interpretability and user trust. Experimental results across five benchmark datasets—RLVS, Hockey Fight, Violent Flow, ShanghaiTech, and UCF-Crime—demonstrated that the proposed method consistently outperformed state-of-the-art models, achieving high accuracy and F1-scores with significantly lower inference time and GPU utilization. The ablation and statistical analyses confirmed the essential contribution of each module, with ANOVA and Tukey’s tests validating the improvements as statistically significant. The proposed framework not only delivers accurate and efficient violence detection but also ensures explainable and reliable decision-making, making it suitable for deployment in intelligent surveillance and forensic systems. Future work will focus on extending this framework using multimodal fusion (audio–visual cues), transformer-based temporal modeling, and federated learning paradigms to enhance adaptability, privacy preservation, and real-time scalability in complex environments.

Author contributions

R.A contributed to the conceptualization of the study, development of the research methodology, and supervision of the overall project. N.A carried out the formal analysis, performed the experiments, and contributed to data curation and result interpretation. H.K.A was responsible for reviewing the literature, preparing the initial draft of the manuscript, and assisting in visualization and figure preparation. A.Q contributed to validation, critical revision of the manuscript for important intellectual content, and coordinated the final editing and submission process. All authors reviewed and approved the final manuscript.

Funding statement

The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through Large Research Project under grant number RGP2/283/46.Princess Nourah bint Abdulrahman University Researchers Supporting Project number ((PNURSP2026R384), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data availability

All datasets used in this research are publicly available and can be accessed through their respective official or publicly maintained repositories. The authors did not use any proprietary, private, or restricted-access data. The direct links to each dataset are provided below:• Real-Life Violence Situations (RLVS): https://www.kaggle.com/datasets/mohamedmustafa/real-life-violence-situations-dataset• Hockey Fight Dataset: https://www.kaggle.com/datasets/yassershrief/hockey-fight-vidoes• Violent Flow (Violent-Flows) Dataset: https://talhassner.github.io/home/projects/violentflows/index.html • ShanghaiTech Campus Surveillance Dataset: https://github.com/desenzhou/ShanghaiTechDataset• UCF-Crime Dataset: Official page: https://www.crcv.ucf.edu/projects/real-world/Kaggle mirror: https://www.kaggle.com/datasets/odins0n/ucf-crime-dataset.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Sharma, H. & Kanwal, N. Video surveillance in smart cities: current status, challenges & future directions. Multimed. Tools Appl.84, 15787–15832 (2024). [Google Scholar]
2.Jency, A. & Ramar, K. A review of abnormal behaviour detection in crowd for video surveillance: Advances and trends, datasets, opportunities and prospects. Expert Syst.42(4), e70013 (2025). [Google Scholar]
3.Chidambaram, V. A. M. & Chandrasekaran, K. P. F3DNN-Net: Behaviours violence detection via fine-tuned fused feature based deep neural network from surveillance video. Signal Image Video Process.18(11), 7655–7669 (2024). [Google Scholar]
4.Ilyas, A. & Bawany, N. Crowd dynamics analysis and behavior recognition in surveillance videos based on deep learning. Multimed. Tools Appl.84(23), 26609–26643 (2025). [Google Scholar]
5.Gong, P. & Luo, X. A survey of video action recognition based on deep learning. Knowl. Based Syst.10.1016/j.knosys.2025.113594 (2025). [Google Scholar]
6.LeCun, Y., Huang, F. J. & Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., 2004, vol. 2, pp. II–104., 2004, vol. 2, pp. II–104. (2004).
7.Majzoobi, F., Khodabakhshi, M. B., Jamasb, S. & Goudarzi, S. ConvLSNet: A lightweight architecture based on ConvLSTM model for the classification of pulmonary conditions using multichannel lung sound recordings. Artif. Intell. Med.154, 102922 (2024). [DOI] [PubMed] [Google Scholar]
8.Mumtaz, N. et al. An overview of violence detection techniques: current challenges and future directions. Artif. Intell. Rev.56 (5), 4641–4666 (2023). [Google Scholar]
9.Şahin, E., Arslan, N. N. & Özdemir, D. Unlocking the black box: An in-depth review on interpretability, explainability, and reliability in deep learning. Neural Comput. Appl.37(2), 859–965 (2025). [Google Scholar]
10.Ullah, F. U. M. et al. A comprehensive review on vision-based violence detection in surveillance videos. ACM Comput. Surv.55 (10), 1–44 (2023). [Google Scholar]
11.Long, C., Cao, Y., Jiang, T. & Zhang, Q. Edge computing framework for cooperative video processing in multimedia IoT systems. IEEE Trans. Multimed.20(5), 1126–1139 (2017). [Google Scholar]
12.Chen, J.-A., Niu, W., Ren, B., Wang, Y. & Shen, X. Survey: Exploiting data redundancy for optimization of deep learning. ACM Comput. Surv.55(10), 1–38 (2023). [Google Scholar]
13.Rajan, M. & Parameswaran, L. Key frame extraction algorithm for surveillance videos using an evolutionary approach. Sci. Rep.15(1), 536. (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Algarni, F., Khan, A. S. & Member, S. Augmenting the robustness and efficiency of violence detection systems for surveillance and non-surveillance scenarios. IEEE Access11(October), 123295–123313 (2023). [Google Scholar]
15.He, B., Armaghani, D. J., Lai, S. H., Samui, P. & Mohamad, E. T. Applying data augmentation technique on blast-induced overbreak prediction: Resolving the problem of data shortage and data imbalance. Expert Syst. Appl.237, 121616 (2024). [Google Scholar]
16.Guo, J., Ma, J., Garc\’\ia-Fernández, Á. F., Zhang, Y. & Liang, H. A survey on image enhancement for low-light images. Heliyon10.1016/j.heliyon.2023.e14558 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Peixoto, B. M., Lavi, B., Dias, Z. & Rocha, A. Harnessing high-level concepts, visual, and auditory features for violence detection in videos. J. Vis. Commun. Image Represent.78, 103174 (2021). [Google Scholar]
18.Yuan, L. & Rizoiu, M.-A. Generalizing hate speech detection using multi-task learning: A case study of political public figures. Comput. Speech Lang.89, 101690 (2025). [Google Scholar]
19.Dündar, N., Keçeli, A. S., Kaya, A. & Sever, H. A shallow 3D convolutional neural network for violence detection in videos. Egypt. Inform. J.26, 100455 (2024). [Google Scholar]
20.Salman, M. et al. Enhancing surveillance anomaly detection with keyframes and explainable inception model. Egypt. Inf. J.31, 100769 (2025). [Google Scholar]
21.Gao, G., Xiao, K., Li, H. & Song, S. An intelligent assessment method of criminal psychological attribution based on unbalance data. Comput. Human Behav.158, 108286 (2024). [Google Scholar]
22.Alansari, M., Ganapathi, I. I., Alansari, S., Al Marzouqi, H. & Javed, S. Visual tracking by matching points using diffusion model. Alex. Eng. J.127, 787–803 (2025). [Google Scholar]
23.Hu, Y. & Lu, X. Learning spatial-temporal features for video copy detection by the combination of CNN and RNN. J. Vis. Commun. Image Represent.55, 21–29 (2018). [Google Scholar]
24.Nazir, A. et al. A deep learning-based novel hybrid CNN-LSTM architecture for efficient detection of threats in the IoT ecosystem. Ain Shams Eng. J.15 (7), 102777 (2024). [Google Scholar]
25.Alomar, K., Aysel, H. I. & Cai, X. RNNs, CNNs and transformers in human action recognition: a survey and a hybrid model, arXiv Prepr. arXiv2407.06162, (2024).
26.Akula, V. & Kavati, I. Human violence detection in videos using key frame identification and 3D CNN with convolutional block attention module. Circuits Syst. Signal Process.43(12), 7924–7950 (2024). [Google Scholar]
27.Mohamed, A., Abdelqader, K. & Shaalan, K. Explainable artificial intelligence: a systematic review of progress and challenges. Intell Syst. Appl, p. 200595, (2025).
28.Mir, A. N. & Rizvi, D. R. Advancements in deep learning and explainable artificial intelligence for enhanced medical image analysis: A comprehensive survey and future directions. Eng. Appl. Artif. Intell.158, 111413 (2025). [Google Scholar]
29.Mensa, E. et al. Violence detection explanation via semantic roles embeddings. BMC Med. Inform. Decis. Mak.20(1), 263. (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Chaddad, A., Hu, Y., Wu, Y., Wen, B. & Kateb, R. Generalizable and explainable deep learning for medical image computing: An overview. Curr. Opin. Biomed. Eng.33, 100567 (2025). [Google Scholar]
31.Olawade, D. B. et al. Artificial intelligence in forensic mental health: A review of applications and implications. J. Forensic Leg. Med.10.1016/j.jflm.2025.102895 (2025). [DOI] [PubMed] [Google Scholar]
32.Li, X. et al. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowl. Inf. Syst.64(12), 3197–3234 (2022). [Google Scholar]
33.Shoaib, M. et al. A deep learning-assisted visual attention mechanism for anomaly detection in videos. Multimed Tools Appl, (2023).
34.Kang, M. S., Park, R. H. & Park, H. M. Efficient spatio-temporal modeling methods for real-time violence recognition. IEEE Access9, 76270–76285 (2021). [Google Scholar]
35.Shuvo, M. R., Mekala, M. S. & Elyan, E. Deep learning and attention-based methods for human activity recognition and anticipation: A comprehensive review. Cognit. Comput.17(6), 1–28 (2025). [Google Scholar]
36.Mahi, A. B. S., Eshita, F. S., Chowdhury, T., Rahman, R. & Helaly, T. VID: A comprehensive dataset for violence detection in various contexts. Data Brief57, 110875 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Khouili, O. et al. Evaluating the impact of deep learning approaches on solar and photovoltaic power forecasting: A systematic review. Energy Strategy Rev.59, 101735 (2025). [Google Scholar]
38.Mahmoodi, J. & Nezamabadi-pour, H. A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs. Pattern Anal. Appl.27 (2), 1–18 (2024). [Google Scholar]
39.Shin, J. et al. Multimodal attention-enhanced feature fusion-based weakly supervised anomaly violence detection. IEEE Open. J. Comput. Soc, (2024).
40.Janani, P., Suratgar, A. & Taghvaeipour, A. Enhancing human action recognition and violence detection through deep learning audiovisual fusion, arXiv Prepr. arXiv2408.02033, (2024).
41.Mahmoud, M. et al. Two-stage video violence detection framework using GMFlow and CBAM-enhanced ResNet3D. Mathematics13(8), 1226 (2025). [Google Scholar]
42.Qi, B., Wu, B. & Sun, B. Automated violence monitoring system for real-time fistfight detection using deep learning-based temporal action localization. Sci. Rep.15 (1), 1–23 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Qi, Z., Zhu, R., Fu, Z., Chai, W. & Kindratenko, V. Weakly supervised two-stage training scheme for deep video fight detection model, in IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), 2022, pp. 677–685., 2022, pp. 677–685. (2022).
44.Soontornnapar, T. & Ploysuwan, T. A novel approach to enhanced fall detection using STFT and magnitude features with CNN autoencoder. Neural Comput. Appl.37 (6), 4229–4245 (2025). [Google Scholar]
45.Barbosa, R. Z. & Oliveira, H. S. A Unified approach to video anomaly detection: advancements in feature extraction, weak supervision, and strategies for class imbalance. IEEE Access, (2025).
46.Rendón-Segador, F. J., Álvarez-Garc\’\ia, J. A., Salazar-González, J. L. & Tommasi, T. Crimenet: Neural structured learning using vision transformer for violence detection. Neural Networks161, 318–329 (2023). [DOI] [PubMed] [Google Scholar]
47.Verma, A. & Yadav, A. K. FusionNet: Dual input feature fusion network with ensemble based filter feature selection for enhanced brain tumor classification. Brain Res.1852, 149507 (2025). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[CR1] 1.Sharma, H. & Kanwal, N. Video surveillance in smart cities: current status, challenges & future directions. Multimed. Tools Appl.84, 15787–15832 (2024). [Google Scholar]

[CR2] 2.Jency, A. & Ramar, K. A review of abnormal behaviour detection in crowd for video surveillance: Advances and trends, datasets, opportunities and prospects. Expert Syst.42(4), e70013 (2025). [Google Scholar]

[CR3] 3.Chidambaram, V. A. M. & Chandrasekaran, K. P. F3DNN-Net: Behaviours violence detection via fine-tuned fused feature based deep neural network from surveillance video. Signal Image Video Process.18(11), 7655–7669 (2024). [Google Scholar]

[CR4] 4.Ilyas, A. & Bawany, N. Crowd dynamics analysis and behavior recognition in surveillance videos based on deep learning. Multimed. Tools Appl.84(23), 26609–26643 (2025). [Google Scholar]

[CR5] 5.Gong, P. & Luo, X. A survey of video action recognition based on deep learning. Knowl. Based Syst.10.1016/j.knosys.2025.113594 (2025). [Google Scholar]

[CR6] 6.LeCun, Y., Huang, F. J. & Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., 2004, vol. 2, pp. II–104., 2004, vol. 2, pp. II–104. (2004).

[CR7] 7.Majzoobi, F., Khodabakhshi, M. B., Jamasb, S. & Goudarzi, S. ConvLSNet: A lightweight architecture based on ConvLSTM model for the classification of pulmonary conditions using multichannel lung sound recordings. Artif. Intell. Med.154, 102922 (2024). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Mumtaz, N. et al. An overview of violence detection techniques: current challenges and future directions. Artif. Intell. Rev.56 (5), 4641–4666 (2023). [Google Scholar]

[CR9] 9.Şahin, E., Arslan, N. N. & Özdemir, D. Unlocking the black box: An in-depth review on interpretability, explainability, and reliability in deep learning. Neural Comput. Appl.37(2), 859–965 (2025). [Google Scholar]

[CR10] 10.Ullah, F. U. M. et al. A comprehensive review on vision-based violence detection in surveillance videos. ACM Comput. Surv.55 (10), 1–44 (2023). [Google Scholar]

[CR11] 11.Long, C., Cao, Y., Jiang, T. & Zhang, Q. Edge computing framework for cooperative video processing in multimedia IoT systems. IEEE Trans. Multimed.20(5), 1126–1139 (2017). [Google Scholar]

[CR12] 12.Chen, J.-A., Niu, W., Ren, B., Wang, Y. & Shen, X. Survey: Exploiting data redundancy for optimization of deep learning. ACM Comput. Surv.55(10), 1–38 (2023). [Google Scholar]

[CR13] 13.Rajan, M. & Parameswaran, L. Key frame extraction algorithm for surveillance videos using an evolutionary approach. Sci. Rep.15(1), 536. (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Algarni, F., Khan, A. S. & Member, S. Augmenting the robustness and efficiency of violence detection systems for surveillance and non-surveillance scenarios. IEEE Access11(October), 123295–123313 (2023). [Google Scholar]

[CR15] 15.He, B., Armaghani, D. J., Lai, S. H., Samui, P. & Mohamad, E. T. Applying data augmentation technique on blast-induced overbreak prediction: Resolving the problem of data shortage and data imbalance. Expert Syst. Appl.237, 121616 (2024). [Google Scholar]

[CR16] 16.Guo, J., Ma, J., Garc\’\ia-Fernández, Á. F., Zhang, Y. & Liang, H. A survey on image enhancement for low-light images. Heliyon10.1016/j.heliyon.2023.e14558 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Peixoto, B. M., Lavi, B., Dias, Z. & Rocha, A. Harnessing high-level concepts, visual, and auditory features for violence detection in videos. J. Vis. Commun. Image Represent.78, 103174 (2021). [Google Scholar]

[CR18] 18.Yuan, L. & Rizoiu, M.-A. Generalizing hate speech detection using multi-task learning: A case study of political public figures. Comput. Speech Lang.89, 101690 (2025). [Google Scholar]

[CR19] 19.Dündar, N., Keçeli, A. S., Kaya, A. & Sever, H. A shallow 3D convolutional neural network for violence detection in videos. Egypt. Inform. J.26, 100455 (2024). [Google Scholar]

[CR20] 20.Salman, M. et al. Enhancing surveillance anomaly detection with keyframes and explainable inception model. Egypt. Inf. J.31, 100769 (2025). [Google Scholar]

[CR21] 21.Gao, G., Xiao, K., Li, H. & Song, S. An intelligent assessment method of criminal psychological attribution based on unbalance data. Comput. Human Behav.158, 108286 (2024). [Google Scholar]

[CR22] 22.Alansari, M., Ganapathi, I. I., Alansari, S., Al Marzouqi, H. & Javed, S. Visual tracking by matching points using diffusion model. Alex. Eng. J.127, 787–803 (2025). [Google Scholar]

[CR23] 23.Hu, Y. & Lu, X. Learning spatial-temporal features for video copy detection by the combination of CNN and RNN. J. Vis. Commun. Image Represent.55, 21–29 (2018). [Google Scholar]

[CR24] 24.Nazir, A. et al. A deep learning-based novel hybrid CNN-LSTM architecture for efficient detection of threats in the IoT ecosystem. Ain Shams Eng. J.15 (7), 102777 (2024). [Google Scholar]

[CR25] 25.Alomar, K., Aysel, H. I. & Cai, X. RNNs, CNNs and transformers in human action recognition: a survey and a hybrid model, arXiv Prepr. arXiv2407.06162, (2024).

[CR26] 26.Akula, V. & Kavati, I. Human violence detection in videos using key frame identification and 3D CNN with convolutional block attention module. Circuits Syst. Signal Process.43(12), 7924–7950 (2024). [Google Scholar]

[CR27] 27.Mohamed, A., Abdelqader, K. & Shaalan, K. Explainable artificial intelligence: a systematic review of progress and challenges. Intell Syst. Appl, p. 200595, (2025).

[CR28] 28.Mir, A. N. & Rizvi, D. R. Advancements in deep learning and explainable artificial intelligence for enhanced medical image analysis: A comprehensive survey and future directions. Eng. Appl. Artif. Intell.158, 111413 (2025). [Google Scholar]

[CR29] 29.Mensa, E. et al. Violence detection explanation via semantic roles embeddings. BMC Med. Inform. Decis. Mak.20(1), 263. (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Chaddad, A., Hu, Y., Wu, Y., Wen, B. & Kateb, R. Generalizable and explainable deep learning for medical image computing: An overview. Curr. Opin. Biomed. Eng.33, 100567 (2025). [Google Scholar]

[CR31] 31.Olawade, D. B. et al. Artificial intelligence in forensic mental health: A review of applications and implications. J. Forensic Leg. Med.10.1016/j.jflm.2025.102895 (2025). [DOI] [PubMed] [Google Scholar]

[CR32] 32.Li, X. et al. Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond. Knowl. Inf. Syst.64(12), 3197–3234 (2022). [Google Scholar]

[CR33] 33.Shoaib, M. et al. A deep learning-assisted visual attention mechanism for anomaly detection in videos. Multimed Tools Appl, (2023).

[CR34] 34.Kang, M. S., Park, R. H. & Park, H. M. Efficient spatio-temporal modeling methods for real-time violence recognition. IEEE Access9, 76270–76285 (2021). [Google Scholar]

[CR35] 35.Shuvo, M. R., Mekala, M. S. & Elyan, E. Deep learning and attention-based methods for human activity recognition and anticipation: A comprehensive review. Cognit. Comput.17(6), 1–28 (2025). [Google Scholar]

[CR36] 36.Mahi, A. B. S., Eshita, F. S., Chowdhury, T., Rahman, R. & Helaly, T. VID: A comprehensive dataset for violence detection in various contexts. Data Brief57, 110875 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Khouili, O. et al. Evaluating the impact of deep learning approaches on solar and photovoltaic power forecasting: A systematic review. Energy Strategy Rev.59, 101735 (2025). [Google Scholar]

[CR38] 38.Mahmoodi, J. & Nezamabadi-pour, H. A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs. Pattern Anal. Appl.27 (2), 1–18 (2024). [Google Scholar]

[CR39] 39.Shin, J. et al. Multimodal attention-enhanced feature fusion-based weakly supervised anomaly violence detection. IEEE Open. J. Comput. Soc, (2024).

[CR40] 40.Janani, P., Suratgar, A. & Taghvaeipour, A. Enhancing human action recognition and violence detection through deep learning audiovisual fusion, arXiv Prepr. arXiv2408.02033, (2024).

[CR41] 41.Mahmoud, M. et al. Two-stage video violence detection framework using GMFlow and CBAM-enhanced ResNet3D. Mathematics13(8), 1226 (2025). [Google Scholar]

[CR42] 42.Qi, B., Wu, B. & Sun, B. Automated violence monitoring system for real-time fistfight detection using deep learning-based temporal action localization. Sci. Rep.15 (1), 1–23 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Qi, Z., Zhu, R., Fu, Z., Chai, W. & Kindratenko, V. Weakly supervised two-stage training scheme for deep video fight detection model, in IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), 2022, pp. 677–685., 2022, pp. 677–685. (2022).

[CR44] 44.Soontornnapar, T. & Ploysuwan, T. A novel approach to enhanced fall detection using STFT and magnitude features with CNN autoencoder. Neural Comput. Appl.37 (6), 4229–4245 (2025). [Google Scholar]

[CR45] 45.Barbosa, R. Z. & Oliveira, H. S. A Unified approach to video anomaly detection: advancements in feature extraction, weak supervision, and strategies for class imbalance. IEEE Access, (2025).

[CR46] 46.Rendón-Segador, F. J., Álvarez-Garc\’\ia, J. A., Salazar-González, J. L. & Tommasi, T. Crimenet: Neural structured learning using vision transformer for violence detection. Neural Networks161, 318–329 (2023). [DOI] [PubMed] [Google Scholar]

[CR47] 47.Verma, A. & Yadav, A. K. FusionNet: Dual input feature fusion network with ensemble based filter feature selection for enhanced brain tumor classification. Brain Res.1852, 149507 (2025). [DOI] [PubMed] [Google Scholar]

PERMALINK

An explainable deep learning framework for video violence detection using unsupervised keyframe selection and attention-based CNN

Rashid Azim

Naveed Abbas

Hend Khalid Alkahtani

Ayman Qahmash

Abstract

Introduction

Literature review

Methodology

Overview of the proposed framework

Fig. 1.

Unsupervised keyframe selection module

Algorithm 1.

Table 1.

Fig. 2.

Attention-enhanced CNN architecture

Fig. 3.

Table 2.

Explainable AI integration (Grad-CAM++)

Fig. 4.

Dataset description and preprocessing

Preprocessing and augmentation pipeline

Fig. 5.

Table 3.

Training configuration and hyperparameter settings

Training setup

Data splitting and evaluation

Fig. 6.

Table 4.

Evaluation metrics

Quantitative performance metrics

Efficiency metrics

Robustness and generalization assessment

Fig. 7.

Experimental results

Descriptive statistical analysis

Fig. 8.

Quantitative evaluation metrics

Table 5.

Table 6.

Efficiency and computational performance

Fig. 9.

Table 7.

Visual interpretability analysis (Grad-CAM + + results)

Fig. 10.

Table 8.

Ablation study

Table 9.

Fig. 11.

Cross-dataset generalization and robustness

Table 10.

Fig. 12.

Statistical hypothesis testing

ANOVA analysis

Post-hoc tukey’s HSD test

Effect size and confidence intervals

Fig. 13.

Table 11.

Comparative analysis with state-of-the-art models

Table 12.

Fig. 14.

Discussion

Conclusion

Author contributions

Funding statement

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases