Multimodal and multiscale feature fusion for weakly supervised video anomaly detection

Wenwen Sun; Lin Cao; Yanan Guo; Kangning Du

doi:10.1038/s41598-024-73462-0

. 2024 Oct 1;14:22835. doi: 10.1038/s41598-024-73462-0

Multimodal and multiscale feature fusion for weakly supervised video anomaly detection

Wenwen Sun ^1,³, Lin Cao ^1,^2,^✉, Yanan Guo ^1,^✉, Kangning Du ¹

PMCID: PMC11445271 PMID: 39354033

Abstract

Weakly supervised video anomaly detection aims to detect anomalous events with only video-level labels. In the absence of boundary information for anomaly segments, most existing methods rely on multiple instance learning. In these approaches, the predictions for unlabeled video snippets are guided by the classification of labeled untrimmed videos. However, these methods do not account for issues such as video blur and visual occlusion, which can hinder accurate anomaly detection. To address these issues, we propose a novel weakly supervised video anomaly detection method that fuses multimodal and multiscale features. Firstly, RGB and optical flow snippets are input into pre-trained I3D to extract appearance and motion features. Then, we introduce an Attention De-redundancy (AD) module, which employs an attention mechanism to filter out task-irrelevant redundancy in these appearance and motion features. Next, to mitigate the effects of video blurring and visual occlusion, we propose a Multi-scale Feature Learning module. This module captures long-term and short-term temporal dependencies among video snippets to provide global and local guidance for blurred or occluded video snippets. Finally, to effectively utilize the discriminative features of different modalities, we propose an Adaptive Feature Fusion module. This module adaptively fuses appearance and motion features based on their respective feature weights. Extensive experimental results demonstrate that our proposed method outperforms mainstream unsupervised and weakly supervised methods in terms of AUC. Specifically, our proposed method achieves 97.00% AUC and 85.31% AUC on two benchmark datasets, i.e., ShanghaiTech and UCF-Crime, respectively.

Keywords: Video anomaly detection, Weak supervision, Multiscale features, Multimodal fusion

Subject terms: Computer science, Information technology

Introduction

With the increasing focus on security, surveillance cameras are becoming more prevalent in public areas like shopping malls, banks, and intersections. Surveillance videos primarily aim to detect abnormal events¹, including road accidents and criminal activities. Traditional methods involve personnel monitoring these videos to detect anomalies. While these methods are effective to an extent, they often require substantial human resources. Moreover, fatigue work or fluke mentality of the staff can lead to wrong detection. Therefore, developing intelligent algorithms capable of automatically detecting anomalous events in monitoring videos is crucial for reducing manual labor, alleviating economic losses, and enhancing social security.

Surveillance videos contain various types of anomalous events, such as motion anomalies and appearance anomalies. These anomalies can differ significantly in visual features and temporal patterns, making it challenging for the detection models to generalize across diverse types of anomalies in videos. In response to these issues, prior research^2–4 primarily focused on extracting the most discriminative features from RGB snippets to effectively detect anomalies. However, these approaches solely rely on unimodal RGB data for anomaly detection. RGB data offers detailed appearance and texture information, enabling effective detection of appearance anomalies. However, its performance in detecting motion anomalies can be relatively limited. For instance, in the case of a high-speed vehicle abruptly halting, it is highly probable that this vehicle has encountered a breakdown or collision. Such an event might not exhibit any visible anomalies in terms of appearance (i.e., RGB), yet the corresponding changes in motion (i.e., optical flow) become distinctly evident. Optical flow and RGB represent motion and appearance features, respectively, which allow them to provide complementary information for video anomaly detection. Some researchers proposed multimodal video anomaly detection approaches by fusing the information from two modalities: RGB and optical flow. These two modalities complement each other and effectively mitigate the limitations presented by insufficient information within a single modality, enabling the detection of diverse types of anomalies in videos.

Current multimodal approaches detecting video anomaly are primarily categorized into two distinct classes: unsupervised methods and weakly supervised methods. Unsupervised approaches^5–10 train the model solely on normal training samples, detecting events deviating from the learned normal patterns as anomalies. These methods alleviate the need for artificially labeled data, resulting in significant reduction in labor costs. Nonetheless, these approaches are vulnerable to mistakenly detecting new, unlearned normal events, leading to decreased accuracy in detecting anomalies. Moreover, by concentrating exclusively on acquiring the feature representation of normal events, these methods overlook effective optimization for abnormal event detection.

Weakly supervised methods train the network by using both video-level labeled abnormal and normal videos. These approaches can acquire more discriminative representations between abnormal and normal events, thus improving the detection performance. Weakly supervised methods consist of two main stages: feature extraction and feature fusion. In the feature extraction stage, certain approaches^11,12 employed pre-trained Inflated 3D ConvNet (I3D) or 3D convolution (C3D) networks to extract motion and appearance representations from optical flow and RGB clips. However, aforementioned feature extraction methods suffer from the following limitations: (1) The extracted features involve redundant information. Previous studies detect abnormal events by directly leveraging representations acquired from pre-trained C3D¹³ or I3D¹⁴. However, C3D or I3D are trained for the task of classifying video actions, rather than specifically for video anomaly detection task, leading to inevitable redundancy in the extracted features. (2) These methods do not account for issues such as video blur and visual occlusion. The employed datasets include video scenes captured under low-light conditions, resulting in video blur. The examples of video blur are shown in Fig. 1. Additionally, the datasets contain scenes with visual occlusion. The examples of visual occlusion are shown in Fig. 2. Regarding video snippets that are blurred or occluded, the inability to extract their complete features hinders the detection of abnormal snippets, leading to inferior detection performance. In the feature fusion stage, Wan et al.¹¹ and Dubey et al.¹² integrated motion and appearance features with equal weight and subsequently fed these features into the anomaly detection classifiers. Wei et al.¹⁵ designed a multimodal fusion framework. This framework comprises two unimodal streams, namely RGB and optical flow, along with their fused stream. The feature representation is enhanced by aligning information from these three streams. Subsequently, the improved features from different streams are concatenated. However, the aforementioned feature fusion methods simply integrate appearance and motion features with equal weight, treating these features equally. However, the significance of appearance and motion features may vary for different types of abnormalities. To effectively leverage features from different modalities, it is crucial to allocate diverse weights to the representations of respective modality.

Fig. 1 — The examples of video blur. The parts highlighted by the red lines indicate blurred anomaly regions.

Fig. 2 — The examples of visual occlusion. The parts highlighted by the red lines indicate occluded anomaly regions.

To cope with the above problems in weakly supervised multimodal video anomaly detection, namely redundancy in extracted appearance and motion representations, the neglect of video blur and visual occlusion, and the equal treatment of appearance and motion features, we propose a novel weakly supervised video anomaly detection method that fuses multimodal and multiscale features. In the feature extraction stage, to enhance feature representation, we introduce an Attention De-redundancy (AD) module. This module captures both local and global features specific to each modality and fuses these local and global features with an attention mechanism to exclude redundancy irrelevant to the task in both motion and appearance features. Intuitively, AD can use the global information to determine whether a certain part is task-irrelevant information redundancy. Furthermore, we devise a Multiscale Feature Learning (MFL) module. This module leverages a dilated convolutional network with different dilation rates to obtain short-term dependency among video clips and a self-attention mechanism to acquire long-term dependencies among video clips, effectively addressing challenges associated with video blur and visual occlusion. In the feature fusion stage, to effectively leverage the representations from different modalities, we propose an Adaptive Feature Fusion (AFF) module. This module leverages global information to capture feature weights for different modalities. It assigns higher weights to the features of the more important modality and lower weights to the features of the less significant modality, enabling adaptive fusion of diverse modalities.

The primary contributions of our work are as follows:

We propose a novel weakly supervised video anomaly detection method that fuses multimodal and multiscale features. By combining optical flow and RGB modalities, our method effectively addresses the limitations arising from by insufficient information within a single modality, enabling the detection of various types of video anomalies.
We devise a MFL module to effectively handle issues related to video blur and visual occlusion. This module employs dilated convolutional networks and a self-attention mechanism to obtain short-term and long-term dependencies among video snippets. Consequently, it provides guidance on global and local information for blurred or occluded video snippets.
We design an AFF module to effectively utilize features of different modalities. This module leverages global information to obtain the importance weights for appearance and motion features. By assigning higher weights to the features of significant modality and lower weights to the features of less significant modality, the AFF module adaptively fuses the motion and appearance features.
We conduct comprehensive experiments on the UCF-Crime and ShanghaiTech datasets. The experimental results demonstrate that our method surpasses mainstream video anomaly detection algorithms.

Literature review

RGB-based Video anomaly detection methods

Hasan et al.² extracted regular features of normal patterns in long-term videos by employing a Convolutional AutoEncoder (CAE) and detected abnormal events based on reconstruction errors. Lower reconstruction errors characterize normal events, while higher reconstruction errors characterize abnormal events. However, in certain cases, the AutoEncoder (AE) can effectively reconstruct abnormal events, leading to missed detection of abnormal events. Gong et al.³ designed the Memory enhanced deep AutoEncoder (MemAE) to address this problem. The MemAE reconstructed the most similar item retrieved from the Memory module, thereby enhancing the reconstruction error associated with anomalies. Liu et al.⁴ detected video anomaly through predicting future frame. The prediction network utilizes the U-Net network architecture to detect abnormal events by comparing the disparity between the predicted future frame and the actual future frame. Sultani et al.¹⁶ applied C3D to extract features from normal and abnormal snippets, subsequently calculating anomaly scores for each snippet. By imposing the constraint that the maximum anomaly score among the abnormal snippets exceeds that among the normal snippets, an iterative training process yields a multi-instance anomaly ranking model for effective anomaly detection. Tian et al¹⁷ detected video anomaly through feature magnitude learning. They utilized the Inflated 3D ConvNet (I3D) network to extract snippet representations and trained a classifier with the top-k representations scores, distinguishing abnormal videos from normal videos more effectively. Since video-level annotated anomalous videos may contain normal segments, Zhong et al.¹⁸ pre-cleared label noise and employed supervised action classifiers to detect abnormal events. By maximizing the utilization of well-established classifiers, their method achieves effective abnormal behavior detection. Zaheer et al.¹⁹ proposed a normal suppression mechanism to minimize the abnormal score of normal snippets and designed a loss function based on clustering distance to mitigate label noise. To address real-time anomaly detection tasks, Zhang et al.²⁰ encoded adjacent snippets through a Temporal Convolutional Network (TCN). Peng et al.²¹ devised an anomaly detection method consisting of four modules: causal temporal relation, classifier, compactness, and dispersion. They utilized I3D to extract snippet representations. Subsequently, they designed the compactness and dispersion modules. The compactness module ensures intra-class compactness of normal features, while the discreteness module amplifies the discrepancy between normal and abnormal videos. Additionally, the causal temporal relationship module obtains local text features. Finally, the classifier module employs causal convolutions to obtain segment-level anomaly scores. Feng et al.²² designed the Multiple Instance Self-Training (MIST) framework. This framework utilized a generator to produce segment-level pseudo-labels for videos, thereby enabling the supervised training of a feature encoder. However, these aforementioned methods employ unimodal RGB data for anomaly detection. Unimodal data may lack comprehensive information, making it challenging to accurately detect anomalies.

Video anomaly detection methods based on multi-modality

Data from different modalities can enhance each other. Numerous computer vision models^23–31 have employed multimodal data to achieve superior performance. Ngiam et al.³² employed deep AE to learn general representation of multimodal data, which yields promising results in speech and vision tasks. Visual and audio events often co-occur. Video and audio provide distinct information. Afouras et al.³³ fused visual and audio data to address specific tasks. Hong et al.²⁴ employed audio data to assist visual data in localizing video highlight. Some studies on video anomaly detection have fused the information from two modalities: RGB and optical flow. Optical flow represents motion information, while RGB represents appearance information. These modalities complement each other, effectively mitigating the limitations presented by insufficient information within a single modality and advancing the anomaly detection performance. Lonescu et al.³⁴ designed a feature learning framework that utilizes an object-centric convolutional AE to capture both motion and appearance information. Moreover, they proposed a classification method for training sample clusters, using an abnormal event classifier to distinguish each normal cluster from the others. Nguyen et al.³⁵ combined a convolutional AE with the U-Net architecture. They utilized the convolutional AE to reconstruct the appearance information of the frames and the U-Net structure to predict the motion information of the frames. The anomaly scores of these frames were computed by sharing the same encoder. Wan et al.¹¹ designed an Anomaly Regression Net (AR-Net) framework to detect anomaly. They leveraged a pre-trained I3D network to extract motion and appearance representations from optical flow and RGB snippets. These features were subsequently concatenated and input into the anomaly detection classifier. Dubey et al.¹² designed the Deep-network with Multiple Ranking Measures (DMRMs) model to detect abnormal events. They used optical flow and RGB snippets as input and employed a pre-trained C3D network to extract motion and appearance representations. These representations were integrated and input into the anomaly detection classifier. To widen the gap in anomaly scores between anomalous and normal snippets, a multi-ranking metric loss was utilized during the training process. Wei et al.¹⁵ designed an anomaly detection framework called Multimodal Supervise-Attention Enhanced Fusion (MSAF). This framework comprised two unimodal streams, namely RGB and optical flow, along with their integrated stream. The features were enhanced by aligning information from these three streams. Subsequently, the improved features from different streams were concatenated. However, the above methods have several challenges including redundancy in extracted motion and appearance features, the neglect of video blur and visual occlusion, and the equal treatment of motion and appearance features.

Temporal action detection

Temporal action detection focuses on identifying the start and end timestamps as well as action categories. Anomaly detection can be regarded as the coarse temporal action detection. Our focus is on weakly supervised action localization, which closely relates to weakly supervised anomaly detection. They share many similarities, such as using only video-level labels for training and the goal of detecting the start and end timestamps of actions. In recent years, several innovative approaches have been proposed, including MHCS³⁶, TSM³⁷, Densernet³⁸, and NGPR³⁹. However, these previous methods do not adequately address challenges like video blur and visual occlusion.

Method

Figure 3a depicts the architecture of our method. A given pair of normal video $B_{n}$ and abnormal video $B_{a}$ is divided into N RGB segments ${\{v_{i}^{RGB}\}}_{i = 1}^{N}$ and optical flow segments ${\{v_{i}^{Flow}\}}_{i = 1}^{N}$ . By using a pre-trained I3D network, motion features ${\{f_{i}^{motion}\}}_{i = 1}^{N}$ and appearance features ${\{f_{i}^{apperance}\}}_{i = 1}^{N}$ are exacted from optical flow snippets ${\{v_{i}^{Flow}\}}_{i = 1}^{N}$ and RGB snippets ${\{v_{i}^{RGB}\}}_{i = 1}^{N}$ , respectively. After that, the AD module filters out the task-irrelevant information in both appearance features ${\{f_{i}^{apperance}\}}_{i = 1}^{N}$ and motion features ${\{f_{i}^{motion}\}}_{i = 1}^{N}$ . Then, the MFL module obtains the long-term and short-term dependencies among segments to address the challenges of video blurring and visual occlusion. In the feature fusion stage, the AFF module adaptively fuses motion features ${\{{\bar{f}}_{i}^{motion}\}}_{i = 1}^{N}$ and appearance features ${\{{\bar{f}}_{i}^{apperance}\}}_{i = 1}^{N}$ . Then, these fused features are input into fully connected layers to calculate the anomaly scores for each snippets in both abnormal and normal videos, i.e. ${\{f (v_{a}^{i})\}}_{i = 1}^{N}$ and ${\{f (v_{n}^{i})\}}_{i = 1}^{N}$ . To enhance the distance between anomalous snippet scores ${\{f (v_{a}^{i})\}}_{i = 1}^{N}$ and normal snippets scores ${\{f (v_{n}^{i})\}}_{i = 1}^{N}$ , we train the network by utilizing the Multiple Instance Learning (MIL) ranking loss.

Fig. 3 — The architecture of our method. a A given pair of normal and abnormal videos is divided into N RGB segments and optical flow segments. In the feature extraction stage, RGB and optical flow segments are fed into pre-trained I3D to extract appearance and motion features. Subsequently, the AD module filters out task-irrelevant redundancy in these appearance and motion features. Furthermore, the MFL module alleviates challenges such as video blurring and visual occlusion. In the feature fusion stage, the AFF module adaptively fuses appearance and motion features. These fused features are then input into fully connected layers to compute the anomaly scores for each segment in both the normal and anomalous videos. These parts highlighted by the orange and blue dashed lines are our key contributions. b Multi-scale feature learning module leverages a dilated convolutional network and a self-attention mechanism to capture long-term and short-term dependencies among clips. By providing global and local information guidance for snippets with blurred or occluded regions, this module effectively mitigates the effects of video blur and visual occlusion. c Adaptive feature fusion module utilizes global information to calculate the importance weights of the features from different modalities. It assigns higher weights to the features of the more important modality and lower weights to the features of the less significant modality, enabling adaptive fusion of diverse modalities.

Attention de-redundancy module

The optical flow segments ${\{v_{i}^{Flow}\}}_{i = 1}^{N}$ and RGB segments ${\{v_{i}^{RGB}\}}_{i = 1}^{N}$ are fed into a pre-trained I3D network to acquire motion and appearance representations. Nevertheless, it’s worth noting that the I3D network is pre-trained on the Kinetics dataset, primarily designed for classifying video actions, rather than specifically for detecting video anomalies. Consequently, the extracted motion and appearance representations may inevitably contain redundancies. To address this challenge, the Attention De-redundancy (AD) module is introduced, as shown in Fig. 4. We input the representations of appearance and motion modality into AD module to capture both local and global features specific to each modality. Subsequently, local and global features are fused to extract task-independent redundancy. Finally, we employ an attention mechanism to filter out these redundancy from the motion and appearance features. To facilitate clarity, we utilize the appearance modality as the illustration. The identical method is applicable to the motion modality as well.

Fig. 4 — Attention de-redundancy module. The module contains Global and Local Context units to distinguish the information redundancy and employs an attention mechanism to filter out these redundancy from the motion and appearance features.

Firstly, to acquire global features of the appearance modality, we input the appearance features $F^{appearance} = {\{f_{i}^{apperance}\}}_{i = 1}^{N}$ into the average pooling layer. Then, a convolutional layer $F^{G}$ is employed to produce global perceptual descriptor $M_{G} \in R^{D \times 1}$ , which D denotes the feature dimensions.

\begin{matrix} X_{g} = & A v g P o o l (F^{appearance}) \end{matrix}

\begin{matrix} M^{G} = & C o n v (X_{g}) \end{matrix}

We utilize the local features of the appearance modality to detect task-independent redundancy within the appearance features. To generate a local perception descriptor $M_{L} \in R^{D \times T}$ , which T denotes the number of snippets, the appearance representations are input into a convolutional layer, formulated as follows:

\begin{matrix} M^{L} = C o n v (F^{appearance}) \end{matrix}

The channel descriptor M is obtained by multiplying global perceptual descriptor $M^{G}$ with local perception descriptor $M^{L}$ . The Sigmoid function is adopted to generate optimized weights of the appearance features $F^{appearance}$ . Therefore, we filter out the redundancy through attention mechanism, formulated as follows:

\begin{matrix} M = & M^{G} \otimes M^{L} \end{matrix}

\begin{matrix} {\bar{F}}^{appearance} = & σ (M) \otimes F^{appearance} \end{matrix}

Where $\otimes$ represents the multiplication operator, $σ (\cdot)$ denotes the Sigmoid function.

Multi-scale feature learning module

The employed datasets include video scenes captured under low-light conditions, resulting in video blur. Additionally, these datasets contain scenes with visual occlusions. Regarding video snippets that are blurred or occluded, the inability to extract their complete features hinders the identification of abnormal snippets, leading to inferior detection performance. To address the challenges posed by video blur and visual occlusion, we propose the Multi-scale Feature Learning (MFL) module, as shown in Fig. 3b. Unlike the multi-scale features extraction methods proposed by Guo et al.^40,41, Yan et al.⁴², Zhang et al.⁴³ and Zhou et al.^44,45, this module leverages a dilated convolutional network and a self-attention mechanism to capture short-term and long-term dependencies among clips. By providing global and local information guidance for snippets with blurred or occluded regions, this module effectively mitigates the effects of video blur and visual occlusion. To facilitate clarity, we utilize the appearance modality as the illustration. The identical method is applicable to the motion modality as well.

The MFL module employs dilated convolutions with dilation rates of 1, 2, and 4 respectively to capture multi-scale temporal dependencies among video clips. Consequently, the MFL module provides local information guidance for snippets with blurred or occluded regions.

To begin with, the optimized appearance features ${\bar{F}}^{appearance}$ are fed into a convolutional layer to compress the number of feature channels, thereby reducing computational requirements. The specific process is formulated as follows:

\begin{matrix} {\bar{F}}_{c}^{appearance} = Re L U (C o n v ({\bar{F}}^{appearance})) \end{matrix}

Where ${\bar{F}}^{appearance}$ is the appearance feature after compression.

Next, to fuse multi-scale local information, the compressed appearance features ${\bar{F}}_{c}^{appearance}$ are passed through dilated convolutional layer, where dilation rate of this layer is 1. The output of this layer ${\bar{F}}_{1}^{appearance}$ is then concatenated with its input ${\bar{F}}_{c}^{appearance}$ . The specific processes are formulated as follows:

\begin{matrix} {\bar{F}}_{1}^{appearance} = & Re L U (D 1_C o n v ({\bar{F}}_{c}^{appearance})) \end{matrix}

\begin{matrix} {\bar{F}}_{c, 1}^{appearance} = & c a t ({\bar{F}}_{c}^{appearance}, {\bar{F}}_{1}^{appearance}) \end{matrix}

Where $D 1_C o n v$ is dilated convolutional layers with expansion rates of 1, respectively. ${\bar{F}}_{1}^{appearance}$ is the output features of the dilated convolutional layers, where expansion rate of this layer is 1.

Similarity, the concatenated result ${\bar{F}}_{c, 1}^{appearance}$ is fed into another dilated convolutional layer $D 2_C o n v$ , where dilation rate of this layer is 2. Subsequently, the output of this layer ${\bar{F}}_{2}^{appearance}$ is concatenated with its input, and the concatenated result ${\bar{F}}_{c, 1, 2}^{appearance}$ is input into a dilated convolutional layer $D 4_C o n v$ , where dilation rate of this layer is 4. Finally, to capture multi-scale temporal dependencies, the compressed appearance features are concatenated with the output information obtained from dilated convolutional layers of varying scales. The specific process is as follows:

\begin{matrix} {\bar{F}}_{c, 1, 2, 4}^{appearance} = c a t ({\bar{F}}_{c}^{appearance}, {\bar{F}}_{1}^{appearance}, {\bar{F}}_{2}^{appearance}, {\bar{F}}_{4}^{appearance}) \end{matrix}

Where ${\bar{F}}_{c, 1, 2, 4}^{appearance}$ represents the multi-scale appearance features.

The MFL module employs a self-attention mechanism to generate a correlation matrix among video clips, thereby capturing global temporal dependencies. Consequently, the MFL module provides global information guidance for video snippets with blurry or occluded regions.

First, the multi-scale features are passed through three separate $1 \times 1$ convolutional layers to acquire three feature matrices, namely $A_{1}$ , $A_{2}$ , and $A_{3}$ .

\begin{matrix} A_{1} = & C o n v ({\bar{F}}_{c, 1, 2, 4}^{appearance}) \end{matrix}

\begin{matrix} A_{2} = & C o n v ({\bar{F}}_{c, 1, 2, 4}^{appearance}) \end{matrix}

\begin{matrix} A_{3} = & C o n v ({\bar{F}}_{c, 1, 2, 4}^{appearance}) \end{matrix}

$A_{1}$ , $A_{2}$ , and $A_{3}$ can be treated as “Query”, “Key” and “Value” in the self-attention module. The attention score is computed by measuring the similarity between Query vectors and Key vectors, determining the relevance of each snippet to others in the sequence. The attention weights are used to perform a weighted sum of the Value vectors, creating global context representations for each video. The specific processes are as follows:

Next, by multiply the feature matrices $A_{1}$ and $A_{2}$ element by element, the correlation matrix among different snippets are obtained.

\begin{matrix} A_{F} = A_{1} \cdot A_{2} \end{matrix}

Where $A_{F}$ is the correlation matrix.

Finally, the correlation matrix $A_{F}$ is multiplied element by element with the feature matrix $A_{3}$ to obtain the global representations. These representations are then input into a convolutional layer. Additionally, the multi-scale features and the global features are summed element-wise. Thus global features are incorporated into the multi-scale appearance features.

\begin{matrix} {\bar{F}}_{c, 1, 2, 4, g}^{appearance} = C o n v (A_{3} \cdot A_{F}) + {\bar{F}}_{c, 1, 2, 4}^{appearance} \end{matrix}

Where ${\bar{F}}_{c, 1, 2, 4, g}^{appearance}$ represents the output feature obtained through a residual connection.

Adaptive feature fusion module

For various types of anomalies, the significance of appearance and motion modalities may vary. To effectively leverage the features from different modalities, we propose the Adaptive Feature Fusion (AFF) module, as shown in Fig. 3c. Unlike the feature fusion methods CTL-AFFS⁴⁶ proposed by Shi et al. and MSFF-Net⁴⁷, CFANet⁴⁸ proposed by Zhang et al., this module utilizes global information to calculate the importance weights of the features from different modalities. It assigns higher weights to the features of the more important modality and lower weights to the features of the less significant modality, enabling adaptive fusion of diverse modalities.

Firstly, the appearance features and motion features are passed through global average pooling separately. Subsequently, 1D convolutions are utilized to obtain the global information for both the appearance and motion features, namely $G^{appearance}$ and $G^{motion}$ .

\begin{matrix} G^{appearance} = & C o n v (G A P ({\bar{F}}_{c, 1, 2, 4}^{appearance})) \end{matrix}

\begin{matrix} G^{motion} = & C o n v (G A P ({\bar{F}}_{c, 1, 2, 4}^{motion})) \end{matrix}

Next, the appearance vectors $G^{appearance}$ and motion vectors $G^{motion}$ are stacked, as well as the Softmax function is applied to obtain the adaptive weight vectors, $ω_{appearance}$ and $ω_{motion}$ , corresponding to the appearance and motion features, respectively.

\begin{matrix} ω_{appearance} = & \frac{e^{G^{^{appearance}}}}{e^{G^{^{appearance}}} + e^{motion}} \end{matrix}

\begin{matrix} ω_{motion} = & 1 - ω_{appearance} \end{matrix}

The feature weight vectors are multiplied with the features of corresponding modality element by element. Then, to obtain the fused feature, the weighted appearance and motion features are added together.

\begin{matrix} F = ω_{appearance} \times {\bar{F}}_{c, 1, 2, 4}^{appearance} + ω_{motion} \times {\bar{F}}_{c, 1, 2, 4}^{motion} \end{matrix}

MIL ranking loss

To enhance the distinction between anomaly scores of anomalous snippets and those of normal snippets, the Multiple Instance Learning (MIL) ranking loss is employed to train the model¹⁶. This loss is formulated below:

\begin{matrix} l (B_{a}, B_{n}) = max (0, 1 - max_{i \in B_{a}} f (v_{a}^{i}) + max_{i \in B_{n}} f (v_{n}^{i})) \end{matrix}

Where $v_{a}$ and $v_{n}$ represent the fused features of the abnormal video and normal video, $max_{i \in B_{a}} f (v_{a}^{i})$ and $max_{i \in B_{n}} f (v_{n}^{i})$ denote the highest scores of snippets in the abnormal and normal videos, respectively.

Given that videos are comprised of sequential segments, the anomaly score ought to exhibit smooth variation between these segments. Furthermore, as abnormal events are infrequent and transient in real-world situations, the abnormal scores of segments within abnormal videos are expected to be highly sparse. Therefore, smoothness item a and sparsity item b are introduced.

\begin{matrix} a = & \sum_{i}^{n - 1} {(f (v_{a}^{i}) - f (v_{a}^{i + 1}))}^{2} \end{matrix}

\begin{matrix} b = & \sum_{i}^{n} f (v_{a}^{i}) s . t . f (v_{a}^{i}) \neq max (f (v_{a}^{i})) \end{matrix}

To mitigate overfitting during network training, we incorporate a regularization term ${∥W∥}_{F}$ . The ultimate formulation of the loss function is presented below:

\begin{matrix} L (W) = l (B_{a}, B_{n}) + λ_{1} a + λ_{2} b + {∥W∥}_{F} \end{matrix}

Where $λ_{1}$ denotes the weights of the smoothness item, $λ_{2}$ denotes the weights of the sparsity item.

Experiments

Dataset and evaluation metric

Dataset. To verify the efficacy of the proposed method, we perform comprehensive experiments on two benchmark datasets, i.e., ShanghaiTech and UCF-Crime. The ShanghaiTech dataset, originating from stationary-angle street video monitoring, is of medium size. This dataset comprises 437 videos, consisting of 307 labeled as normal and 130 labeled as abnormal. The original dataset⁴ was designed for the task of detecting video anomalies unsupervisedly, where the training set exclusively comprises normal videos. To adapt this dataset for weakly supervised algorithms, Zhong et al¹⁸ restructured this dataset by incorporating a portion of the abnormal test videos into the training data. As a result, both the training and the test sets of the ShanghaiTech dataset contain both normal and anomalous videos. We utilize the same processing approach above to make the original ShanghaiTech dataset suitable for weakly supervised algorithms. The UCF-Crime dataset, an extensive anomaly detection dataset, includes 1900 untrimmed videos totaling 128 hours collected from real-world indoor and street monitoring. Differing from the static scene of the ShanghaiTech dataset, the UCF-Crime dataset contains intricate and varied scenes. It covers 13 types of anomalies, including road accidents, vandalism, shoplifting, and others. The training set of UCF-Crime comprises 810 abnormal videos and 800 normal videos, all annotated with video-level labels. The test set consists of 140 abnormal videos and 150 normal videos, all annotated with frame-level labels.

Evaluation metric. The evaluation metric used for all datasets is the area under the ROC curve (AUC). AUC evaluates the classifier’s ability to classify both positive and negative instances. Especially in case of an imbalance between positive and negative bags, it makes more reasonable evaluation for the classifier. The effectiveness of the classifier improves as the AUC value increases.

Implementation details

We implement and train the proposed method on a single GTX 2080Ti GPU using PyTorch. We re-size each video frame to $240 \times 320$ pixels. The optical flow maps are generated utilizing the TV-L1 algorithm⁴⁹. To extract both RGB and optical flow features, we employ the I3D network pre-trained on the Kinetics dataset. For every 16-frame video clip, we compute I3D features and then apply $l_{2}$ normalization. To obtain the features of a video segment, we average all 16-frame clip features within that segment. The 1024D motion and appearance representations extracted from the pre-trained I3D network are input into the AD module. In the AD module, we use the 3 $\times$ 1 Conv1D for each convolutional layer. Each convolutional layer is followed a ReLU activation function. In the MFL module, we use the 3 $\times$ 1 Conv1D for each dilated convolutional layer. Each convolutional layer is followed a LeakyReLU activation function. For the self-attention block, we use the 1 $\times$ 1 Conv1D. The 2048D motion and appearance representations extracted from the MFL module are input into the AFF module. In the AFF module, we use the 1 $\times$ 1 AdaptiveAvgPool1d for the pooling layer and 3 $\times$ 1 Conv1D for each convolutional layer. The 2048D fusion representations obtained from the AFF module are input into the fully connected layers. The nodes of three fully connected layers depicted in the model are 512, 128 and 1, respectively. A ReLU activation function follows each fully connected layer. Both the weights of the sparsity item and the smoothness item in the MIL ranking loss, namely $λ_{2}$ and $λ_{1}$ , are set to $8 \times 10^{- 5}$ . The Adam optimizer is employed to train the proposed method. The learning rate is fixed at 0.001, with a weight decay of 0.0005. During training, a batch size of 16 is utilized, and the training process runs for 300 iterations.

Performance comparison with competing approaches

To assess the performance of the proposed approach, we conduct comparisons with competing approaches on ShanghaiTech and UCF-Crime. These approaches include Conv-AE², Stacked-RNN⁶, GCL⁵⁰, GCN¹⁸, CLAWS¹⁹ and MIST²².

Performance comparison with competing approaches on ShanghaiTech

We conduct comparative experiments between the proposed method and competing algorithms on ShanghaiTech, as presented in Table 1.

Table 1.

Performance comparison with competing methods on ShanghaiTech.

Supervision	Method	AUC(%)
Unsupervised	Conv-AE²	60.85
	Stacked-RNN⁶	68.00
	MNAD⁸	70.50
	Mem-AE³	71.20
	Frame-Pred⁴	73.40
	Deng et al.⁵¹	75.00
	GCL⁵⁰	78.93
	FRD⁵²	80.14
Weakly Supervised	Zhang et al.²⁰	82.50
	GCN¹⁸	84.44
	Sultani et al.¹⁶	85.33
	CLAWS¹⁹	89.67
	AR-Net¹¹	91.24
	CLAWS Net+⁵³	91.46
	MSAF¹⁵	93.46
	MIST²²	94.83
	VPE⁵⁴	96.88
	Ours	97.00

Open in a new tab

The best scores are highlighted in bold.

Hasan et al.² used Convolutional Auto-Encoder (Conv-AE) to learn regular characteristics in long-term videos and performed anomaly detection based on reconstruction error. Lower reconstruction errors indicate normal behaviors, while higher reconstruction errors indicate abnormal behaviors. This method achieved an AUC score of 60.85%. Luo et al.⁶ developed the Temporally-coherent Sparse Coding (TSC) to detect abnormal events. This method used the Stacked Recurrent Neural Network (stacked-RNN) to map TSC, and then performed anomaly detection based on prediction error. This method achieved an AUC score of 68.00%. Liu et al.⁴ detected abnormal events by comparing predicted frames with ground truth and achieved an AUC score of 73.40%. Gong et al.³ designed the Mem-AE to detect abnormal events. The Mem-AE reconstructs the most similar item retrieved from the Memory module. This method achieved an AUC score of 71.20%. Park et al.⁸ introduced a Memory module between the decoder and encoder to enhance the model ability which remembers normal information, and achieved an AUC score of 70.50%. Zaheer et al.⁵⁰ developed Generative Cooperative Learning (GCL) to detect video anomaly unsupervisedly. The generator and discriminator perform cross-supervised training in a cooperative manner. The pseudo-label generated by the generator is utilized to calculate the loss of the discriminator, while the pseudo-label generated by the discriminator is utilized to calculate the loss of the generator. This method achieved an AUC score of 78.93%. The above approaches utilize unsupervised video anomaly detection. Nonetheless, these approaches are vulnerable to mistakenly detecting new, unlearned normal events, leading to decreased accuracy in detecting anomalies. Moreover, by concentrating exclusively on acquiring the features of normal events, these methods overlook effective optimization for abnormal event detection. Consequently, the anomaly detection accuracy is low, with the highest achieved AUC value being 78.93%. In contrast, our proposed method achieves a significantly higher AUC value, surpassing the previous value by 18.07%.

Zhang et al.²⁰, GCN¹⁸, Sultani et al.*¹⁶, and CLAWS¹⁹ employ weakly supervised methods jointly train the network by using both video-level labeled abnormal and normal videos. These approaches can acquire more discriminative representations between abnormal and normal events, thus improving the detection performance. Zhong et al¹⁸ proposed GCN to detect abnormal events, achieving an AUC score of 84.44%. Zhang et al.²⁰ proposed a novel intrinsic loss to effectively constrain the function space of the anomaly detection problem, achieving an AUC score of 82.50%. Sultani et al.¹⁶ detected abnormal events by multi-instance learning, obtaining an AUC score of 85.33%. Zaheer et al.¹⁹ proposed a normal suppression mechanism to minimize the abnormal score of normal instances and designed the clustering distance loss to reduce label noise, achieving an AUC score of 89.67%. Feng et al.²² proposed the MIST framework to detect abnormal events, achieving an impressive AUC score of 94.83%.

Some studies on video anomaly detection have fused the information from two modalities: RGB and optical flow. These two modalities complement each other and effectively mitigate the limitations presented by insufficient information within a single modality, consequently enhancing the effectiveness of anomaly detection.

However, the above methods utilize unimodal RGB data for anomaly detection. Unimodal data lacks sufficient information. Relying on a single modality can result in suboptimal detection performance. To overcome this limitation, Wan et al.¹¹ and Wei et al.¹⁵ fused information from both RGB and optical flow modalities. These modalities complement each other and effectively mitigate the limitations presented by insufficient information within a single modality, thus improving the detection performance. Wan et al.¹¹ proposed the AR-Net framework to detect abnormal events, achieving an AUC score of 91.24%. Wei et al.¹⁵ developed an anomaly detection framework called MSAF, yielding an AUC score of 93.46%. However, despite these advancements, several challenges still exist. These challenges include redundancy in extracted optical flow and RGB representations, the neglect of video blur and visual occlusion, and the equal treatment of RGB and optical flow features. Addressing these challenges can further improve the detection performance. We extract appearance representations from RGB clips and motion representations from optical flow clips, respectively. To address the issue of task-independent redundant information in these features, we introduce the AD module. Additionally, we propose the MFL module to mitigate challenges such as video blur and visual occlusion. Moreover, we propose the AFF module to adaptively fuse the representations from diverse modalities. Our method is superior to mainstream approaches, acquiring the best result with an impressive AUC of 97%.

Performance comparison with competing approaches on UCF-Crime

We further conduct comparative experiments between the proposed method and mainstream algorithms on UCF-Crime, as presented in Table 2.

Table 2.

Performance comparison with competing methods on UCF-Crime.

Supervision	Method	AUC(%)
Unsupervised	Conv-AE²	50.60
	Sohrab et al.⁵⁵	58.50
	Lu et al.⁵⁶	65.51
	GODS⁵⁷	70.46
	GCL⁵⁰	71.04
	FRD⁵²	77.56
	DyAnNet⁵⁸	79.76
	C2FPL⁵⁹	80.65
Weakly Supervised	Sultani et al.¹⁶	77.92
	Zhang et al.²⁰	78.66
	GCN¹⁸	82.12
	MIST²²	82.30
	CLAWS¹⁹	83.03
	Ullah et al.⁶⁰	84.00
	CLAWS Net+⁵³	84.16
	RTFM¹⁷	84.30
	Thakare et al.⁶¹	84.48
	Peng et al.²¹	84.89
	Ours	85.31

Open in a new tab

The best scores are highlighted in bold.

Hasan et al.² proposed the Conv-AE algorithm to detect abnormal events, achieving an AUC score of 50.60%. Sohrab et al.⁵⁵ proposed the subspace support vector data description to detect abnormal events, achieving an AUC score of 58.50%. Lu et al.⁵⁶ detected abnormal events by sparse combination learning, yielding an AUC score of 65.51%. Wang et al.⁵⁷ proposed the Generalized One-class Discriminative Subspaces (GODS). This method flexibly constrains a class of data distribution by learning a pair of complementary classifiers, achieving an AUC score of 70.46%. The GCL algorithm, proposed by Zaheer et al.⁵⁰, achieved an AUC score of 71.04%.

The aforementioned methods employ unsupervised anomaly detection, resulting in low AUC score. The GCN algorithm proposed by Zhong et al.¹⁸ obtained an AUC score of 82.12%. The approach devised by Zhang et al.²⁰ achieved an AUC score of 78.66%. The approach designed by Sultani et al.¹⁶ achieved an AUC score of 77.92%. The MIST devised by Feng et al.²² achieved an AUC score of 82.30%. The approach proposed by Thakare et al.⁶¹ achieved an impressive AUC score of 84.48%. Tian et al.¹⁷ devised the feature magnitude learning (RTFM) to detect abnormal events, achieving an AUC score of 84.30%. Peng et al.²¹ devised an anomaly detection method consisting of four modules: causal temporal relation, classifier, compactness, and dispersion, achieving an AUC score of 84.89%. Zaheer et al.¹⁹ proposed clustering-assisted weakly supervised (CLAWS) learning. This method designed a normal suppression mechanism to minimize the abnormal score of normal snippets and a loss function based on clustering distance to mitigate label noise, achieving an AUC score of 83.03%.

The aforementioned weakly supervised methods enhance detection performance in contrast to unsupervised methods. However, these methods rely solely on unimodal RGB data for anomaly detection, leaving room for further improvement in detection performance. In contrast, our proposed method outperforms mainstream methods, attaining superior results with an impressive AUC score of 85.31%.

Sample performance evaluation

To evaluate the sample performance of our proposed approach, we train the model with varying numbers of abnormal videos on ShanghaiTech. Specifically, we decrease the number of anomalous training videos from 63 to 25, while keeping the number of the test set and normal training videos unchanged. We compare our proposed method with Sultani et al.¹⁶, presenting the results in Fig. 5.

Fig. 5 — The detection results in different numbers of abnormal videos.

Figure 5 illustrates that as the number of anomalous training videos decreases, the AUC scores of both Sultani et al.¹⁶ and our proposed method also decrease. The method of Sultani et al, which uses all abnormal training videos to train the model, achieves an AUC score of 87.03%. Our proposed method, which only uses 25 abnormal training videos to train the model, achieves an AUC score of 87.46%. Despite using fewer abnormal training videos, the proposed method can effectively utilize training data and acquire superior detection performance. This is due to our method’s superior recognition of positive instances in abnormal videos, allowing it to utilize the same training data more effectively than the approach proposed by Sultani et al.

Subtle anomaly discriminability

To evaluate the validity of the proposed method in detecting each anomaly class, the AUC score is calculated for individual anomaly class on UCF-Crime. Additionally, we compare our proposed method with Sultani et al.¹⁶, presenting the results in Fig. 6.

Fig. 6 — The detection results in each individual anomaly class.

Figure 6 illustrates that the proposed method outperforms Sultani et al. in detecting 10 classes of anomalies, namely abuse, arson, assault, road accidents, explosion, fighting, robbery, shooting, stealing, vandalism and shoplifting. However, our proposed method underperforms Sultani et al. in detecting two types of anomalies, namely arrest and burglary. The proposed method significantly improves the accuracy of detecting multiple types of anomalies in complex scenarios by integrating cross-modal feature fusion, multi-scale feature learning, and attention de-redundancy. This approach effectively captures both long-term and short-term temporal dependencies while fully leveraging complementary information from multi-modal data. Nevertheless, the proposed method exhibits distinguished performance in detecting individual anomaly classes.

Ablation studies

Effectiveness of proposed modules. To evaluate the contribution of each module in our method, we conduct ablation experiments on the two anomaly benchmarks, i.e., UCF-Crime and ShanghaiTech datasets. We design some comparison experiments. (1) The baseline is the approach devised by Sultani et al.¹⁶. This approach adopts the I3D network to capture representations from RGB snippets and subsequently calculates anomaly scores by passing these features through three fully connected layers. (2) We incorporate the optical flow snippets in addition to the RGB snippets. (3) Building upon experiment (2), we add the AFF, AD or MFL module. (4) Building upon experiment (2), we add the AFF and AD modules. (5) Building upon experiment (2), we add the AFF, AD and MFL modules. The results of these five comparison experiments are presented in Table 3.

Table 3.

Contribution of each module in our method.

Number	Baseline(RGB)	Optical flow	AFF	AD	MFL	AUC(%) ShanghaiTech	AUC(%) UCF
1	$✓$					85.33	77.92
2	$✓$	$✓$				88.10	81.1
3	$✓$	$✓$	$✓$			91.49	82.56
4	$✓$	$✓$		$✓$		90.51	81.69
5	$✓$	$✓$			$✓$	92.29	83.05
6	$✓$	$✓$	$✓$	$✓$		93.88	83.73
7	$✓$	$✓$	$✓$	$✓$	$✓$	97.00	85.31

Open in a new tab

The best scores are highlighted in bold.

As shown in Table 3, when the baseline incorporates optical flow snippets in addition to the RGB snippets, there is a notable increase in the AUC score. Specifically, the AUC scores increase by 2.77% and 3.18% on ShanghaiTech and UCF-Crime, respectively. This set of experiments validates the effectiveness of combining RGB snippets and optical flow snippets. These two modalities complement each other and effectively mitigate the limitations presented by insufficient information within a single modality, consequently enhancing detection effectiveness. Based on Experiment (2), the addition of AFF, AD or MFL module yields further enhancement in the AUC score. Specifically, the AUC scores increase by 3.39%, 2.41%, 4.19% on ShanghaiTech, as well as 1.46%, 0.59%, 1.95% on UCF. These experiments validate the effectiveness of the proposed AFF, AD and MFL modules. Additionally, combining AD with AFF yields further enhancement in the AUC score. Specifically, the AUC scores increase by 2.39% and 1.17%, respectively. This set of experiments validates the efficacy of the proposed AD module. This module can filter out task-independent redundancy in both optical flow and RGB representations. Based on this, the addition of the MFL module yields further enhancement in the AUC score. Specifically, the AUC scores increase by 3.12% and 1.58%, respectively. This set of experiments validates the efficacy of the proposed MFL module. This module can obtain long-term and short-term dependencies among segments to effectively mitigate issues of video blur and visual occlusion. The above experimental results demonstrate that the proposed modules effectively enhance detection performance.

Effectiveness of dual branches. Table 4 illustrates the impact of applying the AD and MFL modules to branches with RGB and optical flow inputs. We observe the following: 1) Adding the AD module to either the RGB or optical flow branch yields similar results, with AUC scores of 89.76%, 89.69% on ShanghaiTech, as well as 91.14%, 91.23% on UCF-Crime, respectively. Likewise, the addition of the MFL module shows similar effectiveness. 2) When adding the AD or MFL module to both branches simultaneously, these modules achieve the best performance. This demonstrates that all the two branches have irreplaceable roles, and adding the AD or MFL module to both branches simultaneously can further boost the performance.

Table 4.

Effectiveness of dual branches.

Number	AD	MFL	RGB	Optical flow	AUC(%) ShanghaiTech	AUC(%) UCF
1	$✓$		$✓$		89.76	81.32
2	$✓$			$✓$	89.69	81.29
3	$✓$		$✓$	$✓$	90.51	81.69
4		$✓$	$✓$		91.14	81.89
5		$✓$		$✓$	91.23	81.95
6		$✓$	$✓$	$✓$	92.29	83.05

Open in a new tab

The best scores are highlighted in bold.

Compare with other redundancy reduction methods. To demonstrate that our proposed AD module is more effective for weakly supervised video anomaly detection compared to other redundancy reduction methods, we conduct the comparative analysis with approaches like BSP, SGA, and STGA, as shown in Table 5. BSP primarily reduces redundant information by utilizing knowledge transferred from a teacher model. SGA enhances the learning of task-specific representations by optimizing attention map generation through pseudo-label supervision. STGA, which employs a spatial correlation graph and a temporal dependence graph to eliminate redundancy, shows improvement over earlier methods. However, our method still outperforms STGA. The results presented in Table 5 confirm that our AD module achieves the best performance compared with other redundancy reduction methods.

Table 5.

Compare with other redundancy reduction methods.

Method	BSP⁶²	SGA⁶³	STGA⁶⁴	AD
AUC(%) ShanghaiTech	87.23	89.06	89.68	90.51
AUC(%) UCF	80.23	80.90	81.02	81.69

Open in a new tab

The best scores are highlighted in bold.

Hyperparameter analysis. We perform experiments on the ShanghaiTech dataset to analyze the impact of various parameters on the overall performance of our model. Our model includes three key parameters: 1) the number of fully connected layers L; 2) sample number k for Top-k selection; and 3) the loss weights $λ_{1}$ and $λ_{2}$ . Specifically, L controls the depth of the fully connected layer, while k determines the number of samples used to train the proposed model in a bag. The weights $λ_{1}$ and $λ_{2}$ balance the smoothness and sparsity terms. As illustrated in Fig. 7a, our method can achieve the optimal detection performance on the ShanghaiTech dataset when L is set to 3. Similarly, Fig. 7b shows that our method can obtain the best performance on ShanghaiTech when k = 3. Moreover, as depicted in Fig. 7c and d, our method achieves the optimal detection performance on ShanghaiTech when $λ_{1}$ and $λ_{2}$ are both set to $8 \times 10^{- 5}$ .

Fig. 7 — Influences of the hyperparameters. a Layer number, b Sample number, c Weight of loss $λ_{1}$ , d Weight of loss $λ_{2}$ .

Generalizability analysis. To validate the generalization capability of the proposed approach, we perform domain generalization analysis on the ShanghaiTech and UCF-Crime datasets. Specifically, we train the model on the source domain and assess the performance of this model on the target domain. We utilize ShanghaiTech and UCF-Crime datasets in turn as the source and the target domains. The performance comparisons with the above cross-domain settings are presented in Table 6. The following conclusions are obtained from Table 6. 1) The effectiveness of current methods lowers significantly in the cross-domain settings. 2) Our method has strong cross-domain generalization capabilities. In ShanghaiTech->UCF-Crime and UCF-Crime->ShanghaiTech cross-domain settings, Sultani et al. achieves 33.30% and 42.49% AUC accuracies, respectively. Our DSC method outperforms Sultani et al. by 7.44% and 6.14% on AUC accuracy. The effectiveness gain of our method in the cross-domain setting demonstrates that cross-modal feature fusion, multi-scale feature learning, and attention de-redundancy enhance the effectiveness of video anomaly detection on unseen domains.

Table 6.

Experimental results on domain generalization task.

Method	ShanghaiTech->UCF	UCF->ShanghaiTech
AUC(%) Sultani et al.	33.30	42.49
AUC(%) Ours	40.74	48.63

Open in a new tab

The best scores are highlighted in bold.

Complexity analysis

The entire model can run at 44 FPS with an input frame resolution of $240 \times 320$ . Specifically, the MFL module, extracting local and global features, runs at 45 FPS. Following feature extraction, predicting anomaly scores takes just 1.81 ms, equivalent to a speed of 550 FPS. Overall, our model is suitable for online applications.

Qualitative results

To assess the performance of the proposed approach, we compare qualitative performance with four baselines on Shanghai and UCF-Crime by using qualitative descriptions. The qualitative performance is shown in Figs. 8 and 9.

Fig. 8 — Qualitative performance of our method and another three baselines on Shanghai. a Normal, b Skateboard, c Bicycle, d Run, e Car, f Motorcycle. Additionally, “GT” represents the ground truth at the frame-level. “w/o MFL” indicates a baseline without the Multi-scale Feature Learning module compared to our method, “w/o MFL &AD” indicates a baseline without a Multi-scale Feature Learning module and an Attention De-redundancy module compared to our method, and “w/o MFL &AD &AFF” indicates a baseline without a Multi-scale Feature Learning module, an Attention De-redundancy module and an Adaptive Feature module compared to our method, meaning that only optical flow is incorporated. Best viewed in high resolution.

Fig. 9 — Qualitative performance of our method and another three baselines on UCF-Crime. a Normal, b Abuse, c Arrest, d Vandalism, e Explosion, f Robbery.

Specifically, Figs. 8a and 9a represent the visualization performance of normal videos. Our method generates low abnormal scores for normal videos, almost approaching 0. Figs. 8b-f and 9b-f respectively represent the qualitative results for various types of video anomalies. An effective algorithm would assign higher scores to abnormal segments and lower scores to normal segments. It can be seen from Figs. 8b-e and 9b-e that our method generates high abnormal scores when the abnormal events occur, while generating low abnormal scores during normal states. The proposed method accurately localizes the time periods in which the abnormal events occur. The appearance of the explosive event in Fig. 9e changes significantly. The motion of the robbery event in Fig. 9c and the running event in Fig. 8d change significantly. The appearance and motion of the abnormal events in Fig. 8b, c, and e change significantly. These qualitative results demonstrate that fusing RGB and optical flow modalities effectively compensates for the limitations of a single modality. Additionally, the proposed AFF module adaptively fuses RGB and optical flow features to effectively utilize discriminative features, enabling precise detection of various types of video anomalies. Fig. 9b shows a scenario with visual occlusion, while Fig. 9d depicts video blurring caused by low lighting conditions during nighttime capture. The qualitative performance illustrates the efficacy of the proposed MFL module. This module effectively captures the long-term and short-term dependencies among clips to provide global and local information guidance for video clips containing blurred or occluded areas. Thus, the MFL module effectively mitigates challenges associated with video blurring and visual occlusion. The experimental results obtained across various scenes present the superior detection capability of the proposed method. Figs. 8f and 9f illustrate examples of failures due to the lack of guidance from high-level semantic information such as text. In the future, we will focus on incorporating text information to enhance the accuracy and robustness of anomaly.

Conclusion

We propose a weakly supervised video anomaly detection method that fuses multimodal and multiscale features. Firstly, we introduce an AD module to reduce the task-irrelevant redundancy in appearance and motion features extracted from pre-trained I3D. After that, we propose a MFL module. This module incorporates a dilated convolutional network and a self-attention mechanism to capture long-term and short-term temporal dependencies among video clips, effectively addressing challenges associated with video blur and visual occlusion. Finally, to effectively leverage the features from different modalities, we propose an AFF module. This module can obtain feature weights for different modalities to adaptively fuse appearance and motion features. Comprehensive experimental results demonstrate that our approach surpasses mainstream methods on ShanghaiTech and UCF-Crime. In the future, we will consider incorporating semantic information and active learning to improve video anomaly detection efficiency and accuracy.

Acknowledgements

This research was supported by the National Natural Science Foundation of China (U20A20163, 62201066, 62001033), and Beijing Municipal Education Commission Research Program (KZ202111232049, KM202111232014).

Author contributions

L. C. contributed to conceptualization, methodology, writing - review & editing. W. S. contributed to methodology, software, writing - original draft, investigation. Y. G. contributed to formal analysis, investigation, Data curation. K. D. contributed to writing - review & editing, supervision.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Lin Cao, Email: CharLin@bistu.edu.cn.

Yanan Guo, Email: yananguo@bistu.edu.cn.

References

1.Doshi, K. & Yilmaz, Y. Towards interpretable video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2655–2664 (2023). 10.1109/WACV56688.2023.00268.
2.Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A. K. & Davis, L. S. Learning temporal regularity in video sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 733–742 (2016). 10.1109/CVPR.2016.86.
3.Gong, D. et al. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1705–1714 (2019). 10.1109/ICCV.2019.00179.
4.Liu, W., Luo, W., Lian, D. & Gao, S. Future frame prediction for anomaly detection–a new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6536–6545 (2018). 10.1109/CVPR.2018.00684.
5.Yan, C., Zhang, S., Liu, Y., Pang, G. & Wang, W. Feature prediction diffusion model for video anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5527–5537 (2023).
6.Luo, W., Liu, W. & Gao, S. A revisit of sparse coding based anomaly detection in stacked rnn framework. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 341–349 (2017). 10.1109/ICCV.2017.45.
7.Al-lahham, A., Tastan, N., Zaheer, M. Z. & Nandakumar, K. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 6793–6802 (2024).
8.Park, H., Noh, J. & Ham, B. Learning memory-guided normality for anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14372–14381 (2020). 10.1109/CVPR42600.2020.01438.
9.Yu, G. et al. Cloze test helps: Effective video anomaly detection via learning to complete video events. In: Proceedings of the 28th ACM International Conference on Multimedia, 583–591 (2020). 10.1145/3394171.3413973.
10.Deng, H., Zhang, Z., Zou, S. & Li, X. Bi-directional frame interpolation for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2634–2643 (2023). 10.1109/WACV56688.2023.00266.
11.Wan, B., Fang, Y., Xia, X. & Mei, J. Weakly supervised video anomaly detection via center-guided discriminative learning. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), 1–6 (IEEE, 2020). 10.1109/ICME46284.2020.9102722.
12.Dubey, S., Boragule, A., Gwak, J. & Jeon, M. Anomalous event recognition in videos based on joint learning of motion and appearance with multiple ranking measures. Appl. Sci.11, 1344. 10.3390/app11031344 (2021). [Google Scholar]
13.Tran, D., Bourdev, L., Fergus, R., Torresani, L. & Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 4489–4497 (2015). 10.1109/ICCV.2015.510.
14.Carreira, J. & Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6299–6308 (2017). 10.1109/CVPR.2017.502.
15.Wei, D., Liu, Y., Zhu, X., Liu, J. & Zeng, X. Msaf: Multimodal supervise-attention enhanced fusion for video anomaly detection. IEEE Signal Process. Lett.29, 2178–2182. 10.1109/LSP.2022.3216500 (2022). [Google Scholar]
16.Sultani, W., Chen, C. & Shah, M. Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6479–6488 (2018). 10.1109/CVPR.2018.00678.
17.Tian, Y. et al. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 4975–4986 (2021). 10.1109/ICCV48922.2021.00493.
18.Zhong, J.-X. et al. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1237–1246 (2019). 10.1109/CVPR.2019.00133.
19.Zaheer, M. Z., Mahmood, A., Astrid, M. & Lee, S.-I. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In: European Conference on Computer Vision, 358–376 (2020). 10.1007/978-3-030-58542-6_22.
20.Zhang, J., Qing, L. & Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In: 2019 IEEE International Conference on Image Processing (ICIP), 4030–4034 (2019). 10.1109/ICIP.2019.8803657.
21.Wu, P. & Liu, J. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans. Image Process.30, 3513–3527. 10.1109/TIP.2021.3062192 (2021). [DOI] [PubMed] [Google Scholar]
22.Feng, J.-C., Hong, F.-T. & Zheng, W.-S. Mist: Multiple instance self-training framework for video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14009–14018 (2021). 10.1109/CVPR46437.2021.01379.
23.Qi, P., Chiaro, D. & Piccialli, F. Fl-fd: Federated learning-based fall detection with multimodal data fusion. Inform. Fusion99, 101890. 10.1016/j.inffus.2023.101890 (2023). [Google Scholar]
24.Hong, F.-T., Huang, X., Li, W.-H. & Zheng, W.-S. Mini-net: Multiple instance ranking network for video highlight detection. In: European Conference on Computer Vision, 345–360 (2020). 10.1007/978-3-030-58601-0_21.
25.Zhang, G. et al. A unified multi-task semantic communication system for multimodal data. IEEE Trans. Commun.10.1109/TCOMM.2024.3364990 (2024). [Google Scholar]
26.Munro, J. & Damen, D. Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 122–132 (2020). 10.1109/ICCVW.2019.00461.
27.Rao, A. et al. A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10146–10155 (2020). 10.1109/CVPR42600.2020.01016.
28.Zhai, S. et al. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In: Proceedings of the 31st ACM International Conference on Multimedia, 1577–1587 (2023). 10.1145/3581783.3612108.
29.Xu, D., Ouyang, W., Wang, X. & Sebe, N. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 675–684 (2018). 10.1109/CVPR.2018.00077.
30.Wang, Y., Li, Y. & Cui, Z. Incomplete multimodality-diffused emotion recognition. Adv. Neural Inform. Process. Syst. (NeurIPS)36 (2024).
31.Wang, Y., Cui, Z. & Li, Y. Distribution-consistent modal recovering for incomplete multimodal learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 22025–22034 (2023).
32.Ngiam, J. et al. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML), 689–696 (2011).
33.Jiang, R., Zhang, J., Tang, Y., Feng, J. & Wang, C. Self-adaptive de algorithm without niching parameters for multi-modal optimization problems. Appl. Intelligence10.1007/s10489-021-03003-z (2022). [Google Scholar]
34.Ionescu, R. T., Khan, F. S., Georgescu, M.-I. & Shao, L. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7842–7851 (2019). 10.1109/CVPR.2019.00803.
35.Nguyen, T.-N. & Meunier, J. Anomaly detection in video sequence with appearance-motion correspondence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1273–1283 (2019). 10.1109/ICCV.2019.00136.
36.Li, G. et al. Multi-hierarchical category supervision for weakly-supervised temporal action localization. IEEE Trans. Image Process.30, 9332–9344. 10.1109/TIP.2021.3124671 (2021). [DOI] [PubMed] [Google Scholar]
37.Li, G. et al. Boosting weakly-supervised temporal action localization with text information. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10648–10657 (2023).
38.Liu, D. et al. Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI conference on artificial intelligence35, 6101–6109 (2021).
39.Li, G., Cheng, D., Wang, N., Li, J. & Gao, X. Neighbor-guided pseudo-label generation and refinement for single-frame supervised temporal action localization. IEEE Trans. Image Process.10.1109/TIP.2024.3378477 (2024). [DOI] [PubMed] [Google Scholar]
40.Guo, X., Zhang, X., Li, L. & Xia, Z. Micro-expression spotting with multi-scale local transformer in long videos. Pattern Recogn. Lett.168, 146–152. 10.1016/j.patrec.2023.03.012 (2023). [Google Scholar]
41.Guo, X., Peng, W., Huang, H. & Xia, Z. Micro-gesture online recognition with graph-convolution and multiscale transformers for long sequence. In: International Joint Conference on Artificial Intelligence (IJCAI) (2023).
42.Yan, L. et al. Gl-rg: Global-local representation granularity for video captioning. arXiv preprint arXiv:2205.10706 (2022). [DOI] [PMC free article] [PubMed]
43.Zhang, Y., Liu, Y. & Wu, C. Attention-guided multi-granularity fusion model for video summarization. Expert Syst. Appl.249, 123568. 10.1016/j.eswa.2024.123568 (2024). [Google Scholar]
44.Zhou, X. et al. Transformer-based multi-scale feature integration network for video saliency prediction. IEEE Trans. Circuits Syst. Video Technol.33, 7696–7707. 10.1109/TCSVT.2023.3278410 (2023). [Google Scholar]
45.Zhou, X., Cao, W., Gao, H., Ming, Z. & Zhang, J. Sti-net: Spatiotemporal integration network for video saliency detection. Inf. Sci.628, 134–147. 10.1016/j.ins.2023.01.106 (2023). [Google Scholar]
46.Shi, B., Liu, Y., Lu, S. & Gao, Z.-W. A new adaptive feature fusion and selection network for intelligent transportation systems. Control. Eng. Pract.146, 105885. 10.1016/j.conengprac.2024.105885 (2024). [Google Scholar]
47.Zhang, Y., Zhang, T., Wu, C. & Tao, R. Multi-scale spatiotemporal feature fusion network for video saliency prediction. IEEE Trans. Multimedia26, 4183–4193. 10.1109/TMM.2023.3321394 (2023). [Google Scholar]
48.Zhang, Y., Wu, C., Guo, W., Zhang, T. & Li, W. Cfanet: Efficient detection of uav image based on cross-layer feature aggregation. IEEE Trans. Geosci. Remote Sens.61, 1–11. 10.1109/TGRS.2023.3273314 (2023). [Google Scholar]
49.Wedel, A., Pock, T., Zach, C., Bischof, H. & Cremers, D. An improved algorithm for tv-l 1 optical flow. In: Statistical and Geometrical Approaches to Visual Motion Analysis: International Dagstuhl Seminar, 23–45 (Springer, 2009).
50.Zaheer, M. Z. et al. Generative cooperative learning for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14744–14754 (2022). 10.1109/CVPR52688.2022.01433.
51.Deng, H., Zhang, Z., Zou, S. & Li, X. Bi-directional frame interpolation for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2634–2643 (2023).
52.Tao, C. et al. Feature reconstruction with disruption for unsupervised video anomaly detection. IEEE Trans. Multimed. (2024).
53.Zaheer, M. Z., Mahmood, A., Astrid, M. & Lee, S.-I. Clustering aided weakly supervised training to detect anomalous events in surveillance videos. IEEE Trans. Neural Netw. Learn. Syst.10.1109/TNNLS.2023.3274611 (2023). [DOI] [PubMed] [Google Scholar]
54.Su, Y., Tan, Y., Xing, M. & An, S. VPE-WSVAD: Visual prompt exemplars for weakly-supervised video anomaly detection. Knowledge-Based Syst.299, 111978 (2024). [Google Scholar]
55.Sohrab, F., Raitoharju, J., Gabbouj, M. & Iosifidis, A. Subspace support vector data description. In: 2018 24th International Conference on Pattern Recognition (ICPR), 722–727 (2018). 10.1109/ICPR.2018.8545819.
56.Lu, C., Shi, J. & Jia, J. Abnormal event detection at 150 fps in matlab. In: Abnormal event detection at 150 fps in matlab, 2720–2727 (2013). 10.1109/ICCV.2013.338.
57.Wang, J. & Cherian, A. Gods: Generalized one-class discriminative subspaces for anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8201–8211 (2019). 10.1109/ICCV.2019.00829.
58.Thakare, K. V., Raghuwanshi, Y., Dogra, D. P., Choi, H. & Kim, I.-J. Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 5541–5550 (2023).
59.Al-Lahham, A., Tastan, N., Zaheer, M. Z. & Nandakumar, K. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 6793–6802 (2024).
60.Ullah, W., Ullah, F. U. M., Khan, Z. A. & Baik, S. W. Sequential attention mechanism for weakly supervised video anomaly detection. Expert Syst. Appl.230, 120599. 10.1016/j.eswa.2023.120599 (2023). [Google Scholar]
61.Thakare, K. V., Sharma, N., Dogra, D. P., Choi, H. & Kim, I.-J. A multi-stream deep neural network with late fuzzy fusion for real-world anomaly detection. Expert Syst. Appl.201, 117030. 10.1016/j.eswa.2022.117030 (2022). [Google Scholar]
62.Xu, M. et al. Boundary-sensitive pre-training for temporal localization in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 7220–7230 (2021).
63.Feng, J.-C., Hong, F.-T. & Zheng, W.-S. Mist: Multiple instance self-training framework for video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14009–14018 (2021).
64.Chen, H., Mei, X., Ma, Z., Wu, X. & Wei, Y. Spatial-temporal graph attention network for video anomaly detection. Image Vis. Comput.131, 104629. 10.1016/j.imavis.2023.104629 (2023). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

[CR1] 1.Doshi, K. & Yilmaz, Y. Towards interpretable video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2655–2664 (2023). 10.1109/WACV56688.2023.00268.

[CR2] 2.Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A. K. & Davis, L. S. Learning temporal regularity in video sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 733–742 (2016). 10.1109/CVPR.2016.86.

[CR3] 3.Gong, D. et al. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1705–1714 (2019). 10.1109/ICCV.2019.00179.

[CR4] 4.Liu, W., Luo, W., Lian, D. & Gao, S. Future frame prediction for anomaly detection–a new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6536–6545 (2018). 10.1109/CVPR.2018.00684.

[CR5] 5.Yan, C., Zhang, S., Liu, Y., Pang, G. & Wang, W. Feature prediction diffusion model for video anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5527–5537 (2023).

[CR6] 6.Luo, W., Liu, W. & Gao, S. A revisit of sparse coding based anomaly detection in stacked rnn framework. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 341–349 (2017). 10.1109/ICCV.2017.45.

[CR7] 7.Al-lahham, A., Tastan, N., Zaheer, M. Z. & Nandakumar, K. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 6793–6802 (2024).

[CR8] 8.Park, H., Noh, J. & Ham, B. Learning memory-guided normality for anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14372–14381 (2020). 10.1109/CVPR42600.2020.01438.

[CR9] 9.Yu, G. et al. Cloze test helps: Effective video anomaly detection via learning to complete video events. In: Proceedings of the 28th ACM International Conference on Multimedia, 583–591 (2020). 10.1145/3394171.3413973.

[CR10] 10.Deng, H., Zhang, Z., Zou, S. & Li, X. Bi-directional frame interpolation for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2634–2643 (2023). 10.1109/WACV56688.2023.00266.

[CR11] 11.Wan, B., Fang, Y., Xia, X. & Mei, J. Weakly supervised video anomaly detection via center-guided discriminative learning. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), 1–6 (IEEE, 2020). 10.1109/ICME46284.2020.9102722.

[CR12] 12.Dubey, S., Boragule, A., Gwak, J. & Jeon, M. Anomalous event recognition in videos based on joint learning of motion and appearance with multiple ranking measures. Appl. Sci.11, 1344. 10.3390/app11031344 (2021). [Google Scholar]

[CR13] 13.Tran, D., Bourdev, L., Fergus, R., Torresani, L. & Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 4489–4497 (2015). 10.1109/ICCV.2015.510.

[CR14] 14.Carreira, J. & Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6299–6308 (2017). 10.1109/CVPR.2017.502.

[CR15] 15.Wei, D., Liu, Y., Zhu, X., Liu, J. & Zeng, X. Msaf: Multimodal supervise-attention enhanced fusion for video anomaly detection. IEEE Signal Process. Lett.29, 2178–2182. 10.1109/LSP.2022.3216500 (2022). [Google Scholar]

[CR16] 16.Sultani, W., Chen, C. & Shah, M. Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6479–6488 (2018). 10.1109/CVPR.2018.00678.

[CR17] 17.Tian, Y. et al. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 4975–4986 (2021). 10.1109/ICCV48922.2021.00493.

[CR18] 18.Zhong, J.-X. et al. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1237–1246 (2019). 10.1109/CVPR.2019.00133.

[CR19] 19.Zaheer, M. Z., Mahmood, A., Astrid, M. & Lee, S.-I. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In: European Conference on Computer Vision, 358–376 (2020). 10.1007/978-3-030-58542-6_22.

[CR20] 20.Zhang, J., Qing, L. & Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In: 2019 IEEE International Conference on Image Processing (ICIP), 4030–4034 (2019). 10.1109/ICIP.2019.8803657.

[CR21] 21.Wu, P. & Liu, J. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans. Image Process.30, 3513–3527. 10.1109/TIP.2021.3062192 (2021). [DOI] [PubMed] [Google Scholar]

[CR22] 22.Feng, J.-C., Hong, F.-T. & Zheng, W.-S. Mist: Multiple instance self-training framework for video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14009–14018 (2021). 10.1109/CVPR46437.2021.01379.

[CR23] 23.Qi, P., Chiaro, D. & Piccialli, F. Fl-fd: Federated learning-based fall detection with multimodal data fusion. Inform. Fusion99, 101890. 10.1016/j.inffus.2023.101890 (2023). [Google Scholar]

[CR24] 24.Hong, F.-T., Huang, X., Li, W.-H. & Zheng, W.-S. Mini-net: Multiple instance ranking network for video highlight detection. In: European Conference on Computer Vision, 345–360 (2020). 10.1007/978-3-030-58601-0_21.

[CR25] 25.Zhang, G. et al. A unified multi-task semantic communication system for multimodal data. IEEE Trans. Commun.10.1109/TCOMM.2024.3364990 (2024). [Google Scholar]

[CR26] 26.Munro, J. & Damen, D. Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 122–132 (2020). 10.1109/ICCVW.2019.00461.

[CR27] 27.Rao, A. et al. A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10146–10155 (2020). 10.1109/CVPR42600.2020.01016.

[CR28] 28.Zhai, S. et al. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In: Proceedings of the 31st ACM International Conference on Multimedia, 1577–1587 (2023). 10.1145/3581783.3612108.

[CR29] 29.Xu, D., Ouyang, W., Wang, X. & Sebe, N. Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 675–684 (2018). 10.1109/CVPR.2018.00077.

[CR30] 30.Wang, Y., Li, Y. & Cui, Z. Incomplete multimodality-diffused emotion recognition. Adv. Neural Inform. Process. Syst. (NeurIPS)36 (2024).

[CR31] 31.Wang, Y., Cui, Z. & Li, Y. Distribution-consistent modal recovering for incomplete multimodal learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 22025–22034 (2023).

[CR32] 32.Ngiam, J. et al. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML), 689–696 (2011).

[CR33] 33.Jiang, R., Zhang, J., Tang, Y., Feng, J. & Wang, C. Self-adaptive de algorithm without niching parameters for multi-modal optimization problems. Appl. Intelligence10.1007/s10489-021-03003-z (2022). [Google Scholar]

[CR34] 34.Ionescu, R. T., Khan, F. S., Georgescu, M.-I. & Shao, L. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7842–7851 (2019). 10.1109/CVPR.2019.00803.

[CR35] 35.Nguyen, T.-N. & Meunier, J. Anomaly detection in video sequence with appearance-motion correspondence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1273–1283 (2019). 10.1109/ICCV.2019.00136.

[CR36] 36.Li, G. et al. Multi-hierarchical category supervision for weakly-supervised temporal action localization. IEEE Trans. Image Process.30, 9332–9344. 10.1109/TIP.2021.3124671 (2021). [DOI] [PubMed] [Google Scholar]

[CR37] 37.Li, G. et al. Boosting weakly-supervised temporal action localization with text information. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10648–10657 (2023).

[CR38] 38.Liu, D. et al. Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In: Proceedings of the AAAI conference on artificial intelligence35, 6101–6109 (2021).

[CR39] 39.Li, G., Cheng, D., Wang, N., Li, J. & Gao, X. Neighbor-guided pseudo-label generation and refinement for single-frame supervised temporal action localization. IEEE Trans. Image Process.10.1109/TIP.2024.3378477 (2024). [DOI] [PubMed] [Google Scholar]

[CR40] 40.Guo, X., Zhang, X., Li, L. & Xia, Z. Micro-expression spotting with multi-scale local transformer in long videos. Pattern Recogn. Lett.168, 146–152. 10.1016/j.patrec.2023.03.012 (2023). [Google Scholar]

[CR41] 41.Guo, X., Peng, W., Huang, H. & Xia, Z. Micro-gesture online recognition with graph-convolution and multiscale transformers for long sequence. In: International Joint Conference on Artificial Intelligence (IJCAI) (2023).

[CR42] 42.Yan, L. et al. Gl-rg: Global-local representation granularity for video captioning. arXiv preprint arXiv:2205.10706 (2022). [DOI] [PMC free article] [PubMed]

[CR43] 43.Zhang, Y., Liu, Y. & Wu, C. Attention-guided multi-granularity fusion model for video summarization. Expert Syst. Appl.249, 123568. 10.1016/j.eswa.2024.123568 (2024). [Google Scholar]

[CR44] 44.Zhou, X. et al. Transformer-based multi-scale feature integration network for video saliency prediction. IEEE Trans. Circuits Syst. Video Technol.33, 7696–7707. 10.1109/TCSVT.2023.3278410 (2023). [Google Scholar]

[CR45] 45.Zhou, X., Cao, W., Gao, H., Ming, Z. & Zhang, J. Sti-net: Spatiotemporal integration network for video saliency detection. Inf. Sci.628, 134–147. 10.1016/j.ins.2023.01.106 (2023). [Google Scholar]

[CR46] 46.Shi, B., Liu, Y., Lu, S. & Gao, Z.-W. A new adaptive feature fusion and selection network for intelligent transportation systems. Control. Eng. Pract.146, 105885. 10.1016/j.conengprac.2024.105885 (2024). [Google Scholar]

[CR47] 47.Zhang, Y., Zhang, T., Wu, C. & Tao, R. Multi-scale spatiotemporal feature fusion network for video saliency prediction. IEEE Trans. Multimedia26, 4183–4193. 10.1109/TMM.2023.3321394 (2023). [Google Scholar]

[CR48] 48.Zhang, Y., Wu, C., Guo, W., Zhang, T. & Li, W. Cfanet: Efficient detection of uav image based on cross-layer feature aggregation. IEEE Trans. Geosci. Remote Sens.61, 1–11. 10.1109/TGRS.2023.3273314 (2023). [Google Scholar]

[CR49] 49.Wedel, A., Pock, T., Zach, C., Bischof, H. & Cremers, D. An improved algorithm for tv-l 1 optical flow. In: Statistical and Geometrical Approaches to Visual Motion Analysis: International Dagstuhl Seminar, 23–45 (Springer, 2009).

[CR50] 50.Zaheer, M. Z. et al. Generative cooperative learning for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14744–14754 (2022). 10.1109/CVPR52688.2022.01433.

[CR51] 51.Deng, H., Zhang, Z., Zou, S. & Li, X. Bi-directional frame interpolation for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2634–2643 (2023).

[CR52] 52.Tao, C. et al. Feature reconstruction with disruption for unsupervised video anomaly detection. IEEE Trans. Multimed. (2024).

[CR53] 53.Zaheer, M. Z., Mahmood, A., Astrid, M. & Lee, S.-I. Clustering aided weakly supervised training to detect anomalous events in surveillance videos. IEEE Trans. Neural Netw. Learn. Syst.10.1109/TNNLS.2023.3274611 (2023). [DOI] [PubMed] [Google Scholar]

[CR54] 54.Su, Y., Tan, Y., Xing, M. & An, S. VPE-WSVAD: Visual prompt exemplars for weakly-supervised video anomaly detection. Knowledge-Based Syst.299, 111978 (2024). [Google Scholar]

[CR55] 55.Sohrab, F., Raitoharju, J., Gabbouj, M. & Iosifidis, A. Subspace support vector data description. In: 2018 24th International Conference on Pattern Recognition (ICPR), 722–727 (2018). 10.1109/ICPR.2018.8545819.

[CR56] 56.Lu, C., Shi, J. & Jia, J. Abnormal event detection at 150 fps in matlab. In: Abnormal event detection at 150 fps in matlab, 2720–2727 (2013). 10.1109/ICCV.2013.338.

[CR57] 57.Wang, J. & Cherian, A. Gods: Generalized one-class discriminative subspaces for anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 8201–8211 (2019). 10.1109/ICCV.2019.00829.

[CR58] 58.Thakare, K. V., Raghuwanshi, Y., Dogra, D. P., Choi, H. & Kim, I.-J. Dyannet: A scene dynamicity guided self-trained video anomaly detection network. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 5541–5550 (2023).

[CR59] 59.Al-Lahham, A., Tastan, N., Zaheer, M. Z. & Nandakumar, K. A coarse-to-fine pseudo-labeling (c2fpl) framework for unsupervised video anomaly detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 6793–6802 (2024).

[CR60] 60.Ullah, W., Ullah, F. U. M., Khan, Z. A. & Baik, S. W. Sequential attention mechanism for weakly supervised video anomaly detection. Expert Syst. Appl.230, 120599. 10.1016/j.eswa.2023.120599 (2023). [Google Scholar]

[CR61] 61.Thakare, K. V., Sharma, N., Dogra, D. P., Choi, H. & Kim, I.-J. A multi-stream deep neural network with late fuzzy fusion for real-world anomaly detection. Expert Syst. Appl.201, 117030. 10.1016/j.eswa.2022.117030 (2022). [Google Scholar]

[CR62] 62.Xu, M. et al. Boundary-sensitive pre-training for temporal localization in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 7220–7230 (2021).

[CR63] 63.Feng, J.-C., Hong, F.-T. & Zheng, W.-S. Mist: Multiple instance self-training framework for video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14009–14018 (2021).

[CR64] 64.Chen, H., Mei, X., Ma, Z., Wu, X. & Wei, Y. Spatial-temporal graph attention network for video anomaly detection. Image Vis. Comput.131, 104629. 10.1016/j.imavis.2023.104629 (2023). [Google Scholar]

PERMALINK

Multimodal and multiscale feature fusion for weakly supervised video anomaly detection

Wenwen Sun

Lin Cao

Yanan Guo

Kangning Du

Abstract

Introduction

Fig. 1.

Fig. 2.

Literature review

RGB-based Video anomaly detection methods

Video anomaly detection methods based on multi-modality

Temporal action detection

Method

Fig. 3.

Attention de-redundancy module

Fig. 4.

Multi-scale feature learning module

Adaptive feature fusion module

MIL ranking loss

Experiments

Dataset and evaluation metric

Implementation details

Performance comparison with competing approaches

Performance comparison with competing approaches on ShanghaiTech

Table 1.

Performance comparison with competing approaches on UCF-Crime

Table 2.

Sample performance evaluation

Fig. 5.

Subtle anomaly discriminability

Fig. 6.

Ablation studies

Table 3.

Table 4.

Table 5.

Fig. 7.

Table 6.

Complexity analysis

Qualitative results

Fig. 8.

Fig. 9.

Conclusion

Acknowledgements

Author contributions

Data availability

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases