Skip to main content
PeerJ Computer Science logoLink to PeerJ Computer Science
. 2025 Aug 8;11:e3066. doi: 10.7717/peerj-cs.3066

A survey of streaming data anomaly detection in network security

Pengju Zhou 1,
Editor: Vicente Alarcon-Aquino
PMCID: PMC12453818  PMID: 40989415

Abstract

Cybersecurity has always been a subject of great concern, and anomaly detection has gained increasing attention due to its ability to detect novel attacks. However, network anomaly detection faces significant challenges when dealing with massive traffic, logs, and other forms of streaming data. This article provides a comprehensive review and a multi-faceted analysis of recent algorithms for anomaly detection in network security. It systematically categorizes and elucidates the various types of datasets, measurement techniques, detection algorithms, and output results of streaming data. Furthermore, the review critically compares network security application scenarios and problem-solving capabilities of streaming data anomaly detection methods. Building on this analysis, the study identifies and delineates promising future research directions. This article endeavors to achieve rapid and efficient detection of streaming data, thereby providing better security for network operations. This research is highly significant in addressing the challenges and difficulties of analyzing anomalies in streaming data. It also serves as a valuable reference for further development in the field of network security. It is anticipated that this comprehensive review will serve as a valuable resource for security researchers in their future investigations within network security.

Keywords: Anomaly detection, Streaming data, Machine learning, Deep learning

Introduction

In recent years, with the rapid increase in the number of network devices, the generation of streaming data from these devices has catalyzed considerable interest in real-time processing. A significant amount of research work has been dedicated to the formulation of efficient anomaly detection solutions. Currently, anomaly detection for streaming data has been applied to various scenarios, such as network intrusion detection (Tidjon, Frappier & Mammar, 2019; Wahab, 2022), fault detection (Bagozi, Bianchini & De Antonellis, 2021; Bonvini et al., 2014), medical diagnosis (Ren, Ye & Li, 2017; Podder et al., 2023), fraud detection (Laleh & Abdollahi Azgomi, 2010; Dal Pozzolo et al., 2015), and network flow analysis (Pramanik et al., 2022; Tang et al., 2020). Streaming data in the field of cybersecurity primarily originates from system logs, network traffic, intrusion detection devices, and other sources of high-speed, real-time data. These data contain features related to network anomaly events, such as IP addresses and protocol types. While conventional network anomaly detection methods demonstrate commendable performance on static or offline data. However, when faced with streaming data, the inability to perform online learning and real-time model updates poses significant challenges for network security that requires real-time anomaly detection and feedback. In response, researchers have introduced various solutions to address these challenges. Some reviews furnish a broad survey of streaming data anomaly detection algorithms aiming to help understand related concepts, history, and methods. This article comprehensively reviews recent articles on streaming data anomaly detection in cybersecurity and categorizes the scenarios, data types, research methods, and problems addressed. This research aims to achieve two objectives:

  • Systematically categorize existing research to provide network security researchers with a clear understanding of the current state of the field.

  • Summarize several future research directions based on the classification results to help researchers quickly focus on these directions.

The research structure is as follows: “Research Background” introduces common concepts in anomaly detection. “Articles Selection for Literature Review” describes a rigorous systematic review methodology that includes comprehensive multi-database search strategies, explicit inclusion criteria, and a three-stage filtering process. “Challenges and Requirements in Streaming Data for Anomaly Detection” discusses the challenges and requirements faced by streaming data anomaly detection. In “Proposed Classification Method”, this article categorizes anomaly detection datasets, measurement techniques, detection algorithms, result types, reviews and compares related algorithms. “Study of Literature and Discussions” provides a detailed study and discussion of the literature surveyed. “Future Directions and Open Research Challenges” proposes some new research ideas. Finally, “Conclusions” presents the conclusions and provides clear guidance for further work.

Research background

In order to fully understand the research content of streaming data anomaly detection, it is imperative to first elucidate several fundamental concepts. These concepts form the foundational framework for understanding subsequent discussions and furnish the necessary theoretical background for an in-depth exploration of algorithms for streaming data anomaly detection. The following is a detailed explanation of these pivotal concepts:

Anomaly detection: It can be traced back to early statistics and refers to situations that do not match other patterns (Grubbs, 1969). Currently, anomaly detection is generally understood as the identification of events that are uncertain and do not conform to expected patterns.

Streaming data: Streaming data refers to a potentially infinite sequence of data items that arrive continuously at a fast pace. It is defined as {(xt,yt)}1, (xt,yt) represents a data item that arrives at time t. xtRn is a n-dimensional feature vector, ytY={c1,c2,,ck} and is the class label associated with the data item.

Concept drift: Concept drift was proposed by Schlimmer & Granger (1986). Formally, the process generating the streaming data can be considered as a joint distribution over random variables Y and X={X1,X2,,Xn}, where ydom(Y), Y and X={X1,X2,,Xn}, where ydom(Y) represents the class label and xidom(X) represents the attribute values, with dom() denoting the domain of a random variable. The concept at time t can be represented as P(Y|Xt), where Xt refers to the input data. Concept drift occurs when there is a change in the relationship between input data and the target label data due to changes in the characteristics of the data from different sources at different time t, P(Y|Xt1)P(Y|Xt2). The various types of concept drift encompass sudden concept drift, gradual concept drift, incremental concept drift, and recurring concept drift (Ramírez-Gallego et al., 2017).

Feature drift: Let F denote feature space at t, where FtF represents the highest discriminative subset of features. Feature drift occurs when there is a change in the feature subset, FtiFtj,titj. Feature drift can occur both when there is a change in the data distribution or when there is no change (Barddal et al., 2017).

Concept evolution: Let Y={c1,c2,,ck} be a set of classes from the training set used to train a classifier, representing the known concepts related to the underlying problem. The classes in Y are referred to as known or existing classes. During the generation of the streaming data, a new class emerges that does not exist in Y. This class is referred to as an emerging class, and this phenomenon is known as concept evolution (Kulesza et al., 2014).

Time window: Time window refers to a subsequence between the ith and jth arrivals in a sequence, denoted as w[i,j]=(xi,xi+1,,xj), where i<j. Time windows can be categorized as Landmark window, Time-dampened window (also known as Fading window), Sliding window, and Tilted-time window (Nguyen, Woon & Ng, 2015). Figure 1 for a visual representation of these categories.

Figure 1. Time window model.

Figure 1

(A) Landmark window. (B) Time-dampened window. (C) Sliding window. (D) Tilted-time window.

Articles selection for literature review

Overview of the review process

This investigation utilized a systematic methodology to select high-quality articles on streaming data anomaly detection, with pronounced emphasis on network security applications. It established a structured review framework comprising: (1) comprehensive search across multiple databases, (2) the application of stringent inclusion criteria, and (3) the meticulous filtering and critical appraisal of the resultant literature.

Data sources

The systematic review was conducted from September 2023 to October 2024, focusing on publications from 2015–2024. The present study conducted an exhaustive and detailed search of relevant literature, including research articles from journals, conferences, books, and magazines. To guarantee comprehensive coverage, multiple databases were utilized, including Google Scholar, IEEE Xplore, Springer, ScienceDirect, Scopus, Web of Science, and ACM Digital Library. Furthermore, a selection of highly reputable conferences such as USENIX, WWW, SIGKDD, VLDB, SIGMOD, NDSS, S&P, ICDM, and CCS were considered. Figure 2 visually presents the distribution of research from different resource types. As depicted, the corpus of literature under review comprises 884 conference articles, 715 journal articles, and 110 books included in the study.

Figure 2. Distribution of literature types.

Figure 2

Search keywords and scope

The study search was confined to computer science and engineering domains, utilizing a two-phase keyword strategy:

  • Initial phase: (Stream* data OR Stream* OR Real-time data) AND (Anomaly detection OR Outlier detect* OR Deviation detection OR Novelty detection).

  • Refinement phase: Added domain-specific terms (IoT OR ICS OR SDN OR GRID OR WSN OR Cloud OR Sensor OR Embedded Systems OR Intrusion Detection).

where “∗” indicates wildcard matching to capture variations of the terms.

Selection criteria

To ensure the quality and relevance of the selected literature, the study established specific inclusion criteria as delineated in Table 1. These criteria guided both initial screening and detailed evaluation phases.

Table 1. Criteria for literature selection and explanation.

Selection criteria Explanation
Journal, conference, book Comprehensively search across various sources to ensure a broad collection of information.
Renowned conference or journal Prioritize sources from well-regarded platforms to maintain the quality and credibility of the data.
High citation or download count Consider articles with high citation or download counts as these are indicative of the work’s recognition and impact within the academic community.
Literature from the past 10 years Focus on the most recent studies to track the latest developments and current trends.
Articles proposing detection methods Specifically look for articles that propose new or improved detection methods to understand the principles and effectiveness of these approaches.

Filtering process

The study employed a systematic selection process, illustrated in Fig. 3, encompassing three principal stages:

Figure 3. Screening process.

Figure 3

  1. Initial screening: In the initial phase, it used the keywords mentioned in the first strategy to screen out a total of 2,620 publications. Subsequently, based on the crude detection in the secondary strategy’s criteria, it screened out 909 publications, leaving 1,711 publications. From 1,711 identified articles, this study removed publications lacking academic rankings and those irrelevant based on abstract review, yielding 252 publications.

  2. Quality assessment: Further evaluation based on methodological rigor, citation impact, and relevance to streaming anomaly detection reduced the selection to 105 high-quality articles.

  3. Final classification: Detailed examination identified 42 articles specifically focused on network security applications of streaming anomaly detection, which formed the core of the analytical review.

Challenges and requirements in streaming data for anomaly detection

Streaming data is characterized by real-time, high-speed, infinite, and variable properties. This study delineates the specific challenges and requirements faced by anomaly detection in streaming data in the field of network security.

  • Single pass scanning: Owing to inherent storage constraints, detection algorithms should be able to process data in a single pass, strictly adhering to the sequential arrival order of the streaming data.

  • Limited memory: Streaming data is continuous and inexhaustible, making it impractical to retain the entirety of memory. Detection algorithms must use smaller memory space to detect anomalies, ensuring that other applications of the system are not affected.

  • Dynamic adaptation: As streaming data evolves over time, detection algorithms must be engineered for dynamic adaptation or incremental updates to maintain high accuracy in a perpetually changing data landscape.

  • Result approximation: Traditional anomaly detection algorithms based on static data yield reasonably accurate results. However, streaming data detection requires processing data in extremely short timescales, often necessitates the acceptance of approximate outcomes. Anomaly detection algorithms must therefore strike a delicate balance between computational performance and detection accuracy in this time-constrained environment.

Proposed classification method

This section presents a systematic classification of methodologies for streaming data anomaly detection. Initially, it delves into the dataset typologies employed in the field of network security anomaly detection, which is pivotal in enhancing the caliber of investigative results. Subsequently, the discourse transitions to the examination of measurement techniques tailored for streaming data, which are instrumental for mitigating storage overhead and accelerating processing throughput. The analysis then proceeds to scrutinize the diverse spectrum of detection algorithms, with a particular emphasis on enhancing their efficacy and precision. The advocated classification schema is presented in Fig. 4.

Figure 4. Classification of streaming data anomaly detection algorithms.

Figure 4

Dataset classification

Given that real-world datasets are not universally designed for streaming data, recent research has principally adopted two methodologies. The initial approach identifies anomalies by classifying infrequent occurrences as statistical outliers. The second strategy, congruent with the real-time characteristics of streaming data, pinpointing anomalies that conform to the protocol’s real-time criteria. The classification of the datasets is as follows:

  • Statistical dataset: It refers to datasets composed of multiple continuous or discrete features presented in tabular form Moustafa, Turnbull & Choo (2018).

  • Sequential dataset: It refers to datasets where there is implicit time correlation or dependency between consecutive data points (Karim, Majumdar & Darabi, 2020).

  • Spatial dataset: It refers to datasets with specific locations and spatial relationships (Wang et al., 2020a).

This study presents a compilation of commonly used datasets, with information including the dataset name, data entries, classified attributes, the number of attacks, and the statistically derived anomaly thresholds, as shown in Table 2.

Table 2. Dataset information.

Dataset Records Attributes Attack records Reference
KDDCUP 99 567,497 4 3,377 The UCI KDD Archive (1999)
CIC 2017 (Thursday-WebAttacks) 1,070,367 4 2,180 Sharafaldin, Lashkari & Ghorbani (2018)
CIC 2017 (Tuesday-Working Hours) 445,909 3 13,835 Sharafaldin, Lashkari & Ghorbani (2018)
CIC 2017 (Friday-Morning) 191,033 2 1,966 Sharafaldin, Lashkari & Ghorbani (2018)
CIC 2017 (Friday-Afternoon-DDos) 225,745 2 128,027 Sharafaldin, Lashkari & Ghorbani (2018)
CIC 2017 (Friday-Afternoon-PortScan) 286,467 2 158,930 Sharafaldin, Lashkari & Ghorbani (2018)
CIC 2017 (Wednesday-working Hours) 692,703 6 252,672 Sharafaldin, Lashkari & Ghorbani (2018)
CTU-13 (48) 114,077 52 63 Garcia et al. (2014)
CTU-13 (52) 107,251 58 8,164 Garcia et al. (2014)
Edge-IIoTset 20,952,648 15 9,728,708 Ferrag et al. (2022)
UNSW-NB15 82,332 9 45,332 Moustafa & Slay (2015)
MQTTset 8,456,823 6 115,822 Vaccari et al. (2020)
CIDDS001 (Week1) 8,451,520 4 1,440,623 Ring et al. (2017)
CIDDS001 (Week2) 10,310,733 4 1,795,404 Ring et al. (2017)

Measurement technique

Streaming data measurements in large-scale, high-speed networks is crucial for anomaly detection. A multitude of algorithms estimate streaming data for anomaly detection using approximate computation and compression techniques under the stringent constraints of limited memory and computational resources. This study provides a systematic summarization of prevalent streaming data measurement techniques as follows:

  • Frequency estimation: Frequency estimation estimates the frequency of elements in streaming data using hash functions and counters. Prominent examples include space-saving, count-min sketch (CMS). CMS uses multiple hash functions and a 2D array to estimate frequencies (Cormode & Muthukrishnan, 2005). Given its exceptional space efficiency, rapid query performance, and robust fault tolerance, this approach is widely adopted across diverse applications. Tong & Prasanna (2017) employed CMS and K-ary sketch for heavy hitter detection and heavy change detection. Bhatia et al. (2022) used CMS for detecting anomaly micro-clusters in streaming data.

  • Quantile estimation: Quantile estimation achieves quantile estimation by maintaining quantile summaries or quantile estimators within error boundaries, such as t-digest (Radke et al., 2018) and the Greenwald-Khanna algorithm (Lall, 2015).

  • Change detection: Change detection identifies change points or anomalies that deviate from the normal pattern by analyzing patterns, trends, or statistical features in streaming data. Among the most commonly employed algorithms are the cumulative sum (CUSUM) control chart and the Page-Hinkley (PH) test. Martínez-Rego et al. (2015) adopted Bernoulli CUSUM for change detection. Duarte, Gama & Bifet (2016) availed of PH to detect changes in the data generation process and respond to them using pruning rules.

  • Cardinality estimation: Cardinality estimation makes use of hash functions or bit arrays to map elements to specific positions and estimate cardinality based on statistical information. In contrast to frequency estimation, cardinality estimation is exclusively focused on the total number of distinct elements in streaming data, such as LogLog, HyperLogLog, and bloom filter. It is worth noting that HyperLogLog estimates cardinality based on the maximum leading zero count (LZC) in an array, while LogLog estimates cardinality based on the maximum zero count in an array. Xiao et al. (2023) proposed three HyperLogLog-based algorithms to estimate streaming distribution and reduce estimation errors.

  • Similarity calculation: Similarity calculation is the relationship between data objects by comparing their similarities. Locality sensitive hashing (LSH) is a prevalently used algorithm. LSH functions by selecting a family of hash functions engineered to map similar data points to the same bucket with a high probability. Notable variants include MinHash, LSH Forest, and random projection (RP). Zeng et al. (2023b) used double locality sensitive hashing bloom filter (DLSH) to improve accuracy and efficiency. Yang et al. (2022) proposed DLSHiForest to address the inherent property of infinite, correlated, and concept drift in traditional static anomaly detection algorithms. Pham et al. (2014) applied RP to obtain compressed data and solve the scalability issue. Lai et al. (2022) replaced the entropy estimation calculation with a simple lookup process using RP.

  • Sampling: Sampling algorithms approximate the analysis of entire streaming data by selecting a subset of elements from the streaming data. These include sticky sampling (SS) and reservoir sampling (RS). SS prioritizes sampling based on data priority, while RS is a random sampling technique. Wang et al. (2023) adopted weighted RS to model the distribution characteristics of historical reliability streaming data in mobile edge computing (MEC). Yu et al. (2018) applied RS algorithm to represent the vectors of vertices in dynamic network computation.

Detection algorithm

Research on streaming data anomaly detection has evolved through multiple scholarly trajectories. Comprehensive studies have systematically examined diverse detection methodologies, with each addressing distinct aspects: Wang, Bah & Hammad (2019) taxonomized approaches into distance-based, clustering-based, density-based, ensemble-based, and learning-based categories (2000–2019); Boukerche, Zheng & Alfandi (2020) analyzed algorithmic efficiency parameters and high-dimensional processing challenges while proffering a novel classification framework; Din et al. (2021) specifically addressed concept evolution phenomena in streaming classification; Bhaya & Alasadi (2016) evaluated streaming mining techniques for network traffic anomaly detection and Souiden, Brahmi & Toumi (2016) furnished comparative assessment frameworks for algorithm selection across various contexts. Further specialized research has emerged along two distinct lines. The first focuses on specific application scenarios: Fahy, Yang & Gongora (2022) addressed the label scarcity problem in dynamic streams with concept drift; Stahmann & Rieger (2021) investigated anomaly detection requirements in Industry 4.0 manufacturing environments with millisecond-frequency sensor data. The second line explores specialized methodological frameworks: Krawczyk et al. (2017) examined ensemble learning approaches for non-stationary stream environment; Faria et al. (2016) analyzed offline/online phase integration and noise-anomaly differentiation in novelty detection; Barbariol et al. (2022) evaluated tree-based methods, particularly iForest variants; Clever et al. (2022) constructed a structured framework for streaming classification workflows. Additionally, comprehensive reviews have addressed cross-cutting challenges in streaming data processing: Gurjar & Chhabria (2015) examined concept evolution in streaming classification with methods for unknown class detection; while (Chauhan & Shukla, 2015) explored K-Means applications for clustering-based anomaly detection in high-volume, concept-drifting streaming data.

Notwithstanding these significant contributions of existing research, the rapid advancement of big data technologies, machine learning, and deep learning has precipitated the emergence of numerous innovative methodologies in recent years. Accordingly, this study proffers a new taxonomic framework that synthesizes both established classical methods and recently developed approaches into two primary categories: traditional machine learning and deep learning. Within the traditional machine learning paradigm, models are further classified according to their algorithmic principles: statistical models, distance models, clustering models, density models, isolation models, and frequent item mining models. Analogously, deep learning approaches are categorized as reconstruction models, generative models, predictive models, and representation learning models. Employing this structured framework, the present study undertakes a systematic review and comparative evaluation of these distinct methodological classes.

Traditional machine learning

The methods based on traditional machine learning are as follows:

  • Statistical models: These models observe and analyze observable streaming data based on principles of probability theory and statistics. They infer underlying patterns among data to detect anomalies. The category includes Fourier Transform, Wavelet Transform, Power Spectral Density, Gaussian models and Entropy models. Hunt & Willett (2018) used a dynamic and low-rank Gaussian mixture model for online anomaly detection in wide-area motion imagery and e-mail databases. Tao & Michailidis (2019) utilized higher-order statistical information to detect attackers in power systems. Chouliaras & Sotiriadis (2019) implemented a suite of algorithms including autoregressive integrated moving average (ARIMA), seasonal ARIMA, and long short-term memory (LSTM) to detect anomalies in sensor data. Yu, Jibin & Jiang (2016) leveraged ARIMA model to detect anomalies in WSN.

  • Distance models: These models quantify the similarity between two sequences using explicit distances, such as Euclidean distance, Manhattan distance, Chebyshev distance and Minkowski distance. When the distance between the sequence being tested and the normal sequence exceeds the expected similarity measure range, the sequence is flagged as an anomaly. Zhu et al. (2020) applied min heap to compute upper bound or lower bound of distances between objects and their kth nearest neighbor for anomaly detection in IoT streaming. Ma, Aminian & Kirby (2019) employed radial basis function (RBF) to perform novelty detection and prediction on streaming data of time series. Miao et al. (2018) used a distributed online one-class support vector machine (OCSVM) for anomaly detection in WSN.

  • Clustering models: These models map sequence data items into a n-dimensional space and group them into different clusters based on similarity in the latent space. If a new data item is far from the centroids of clusters or has a low probability of belonging to any cluster, it can be considered as an anomaly. Clustering models designed for streaming data can be classified based on traditional clustering algorithms, including hierarchical-based, partition-based, density-based, grid-based, and model-based clustering. Maimon & Rokach (2005) provided a formal framework for understanding the key distinctions between these clustering approaches. Building on this foundation, Mousavi, Bakar & Vakilian (2015) demonstrated in their comprehensive study of data stream clustering algorithms that streaming data clustering differs significantly from traditional clustering in several aspects. These differences arise due to the inherent characteristics of streaming data, such as the need to read data in a specific order, processing in short time intervals, and receiving the next instance before storing the current entire stream. Recent research has extended these foundational concepts to develop more sophisticated clustering approaches for streaming data anomaly detection. Lee & Lee (2022) proposed a kernel-based clustering method to efficiently solve the online clustering problem of multivariate streaming data. To enhance detection accuracy in diverse streaming applications, Degirmenci & Karal (2022) combined local outlier factor (LOF) and density-based spatial clustering of applications with noise (DBSCAN). To contend with the challenge of noise in streaming data, Bigdeli et al. (2018) introduced a novel method called collective probability labeling (CPL), which combines clustering and gaussian models to gradually update clusters and mitigate the impact of noise on detection results. For large-scale streaming data processing, Bagozi, Bianchini & De Antonellis (2021) developed a parallelized framework for incremental clustering that achieves sustainable processing on distributed architectures. In the IoT domain, Raut et al. (2023) applied adaptive window and adaptive clustering techniques to infer interesting events from continuous sensor streaming data. Bezerra et al. (2020) tackled the fundamental problem of autonomous cluster creation and merging in streaming data using innovative online recursive clustering techniques. Table 3 synthesizes these insights to provide a contrastive analysis between clustering in streaming data and traditional clustering.

  • Density models. These models determine which data points are anomalies by calculating the density around the data points. It is noteworthy that there is an overlap between density-based and distance-based models, as density-based models often rely on distance calculations. LOF is the most widely used density-based method, identifies anomalies by comparing the local neighborhood density of data points. Similarly, DBSCAN identifies noise points as anomalies through density-based clustering. Kernel density estimation (KDE) has emerged as a potent non-parametric technique for this purpose. Zhang, Zhao & Li (2019) utilized KDE for density estimation within sliding windows, which significantly enhanced context anomaly detection performance for streaming data. Liu et al. (2020) developed a top-n methodology based on KDE that effectively addresses local anomaly detection challenges in large-scale, high-throughput streaming environments. To overcome the limitations associated with high-dimensional data processing, more recently, Ting et al. (2023) demonstrated that adaptive KDE techniques can dynamically adjust to evolving data distributions, thereby providing more robust probability density estimates for anomaly assessment in non-stationary streaming environments. To address the dual challenges of high-dimensional data and storage efficiency, researchers have developed various extensions and hybrid approaches. Yang, Chen & Fan (2021) introduced the extended LOF, which effectively solved the problems of large storage space requirements and unsatisfactory detection results for high-dimensional data. Aggarwal & Yu (2008) proffered a density-based methodology capable of operating effectively without assumptions regarding the underlying data distribution, thereby eliminating associated uncertainties. In the context of real-time processing, several innovative implementations have been proposed. Shylendra et al. (2020) implemented the KDE kernel via CMOS Gilbert Gaussian unit, providing a real-time statistical model for the likelihood estimation detection algorithm. Zheng et al. (2017) demonstrated the effectiveness of KDE for real-time outlier detection in distributed streaming data environments. The integration of density-based methods with other techniques has also yielded promising results. Gokcesu et al. (2018) combined density-based approaches with incremental decision tree (IDT) to construct subspaces of the observation space, effectively detecting anomalies hidden in sequential observation streaming data. Vallim et al. (2014) built upon the Denstream framework proposed by Cao et al. (2006) to develop an unsupervised automatic transformation framework based on density and entropy indicators.

  • Isolation models: These models are predicated on the principle of isolating or partitioning data instances. They separate outliers from normal data points by calculating distance, similarity, or constructing boundary and hyperplane. Isolation forest (iForest) was proposed by Liu, Ting & Zhou (2008), stands out as the most classic and foundational method. iForest constructed a set of isolation trees by recursively partitioning the data. Each tree isolates outliers in the shallow layers while normal points are isolated in deeper layers. Numerous enhancements have been proposed to enhance iForest for streaming data scenarios. Shao et al. (2020) developed AR-iForest, a combination of auto-regressive modeling and isolation forest that aims to enhance the efficacy of anomaly detection in time series data. Heigl et al. (2021) presented PCB-iForest, a solution to the challenges posed by high-volume, high-speed streaming data in computer networks. This implementation integrates extended iForest variants with the capacity to evaluate features based on their contribution to a sample’s anomalousness.

  • Frequent item mining models: Frequent item mining models mine frequently occurring patterns or items from streaming data as normal patterns. When the pattern of new data appearing in streaming data does not match these frequent patterns, it is marked as anomalous data. Cai et al. (2020a) proposed a two-phase minimal rare itemset mining methodology detected anomalies in uncertain streaming data. Cai et al. (2020b) used min weighted rare items mining to detect anomalies in uncertain streaming data. Hao et al. (2019) proposed a method for mining frequent itemsets from uncertain streaming data through the construction of matrix structures and the application of upper-bound concepts.

Table 3. Comparison between streaming data clustering and traditional clustering.
Streaming cluster Traditional cluster References
Online processing Offline processing Mousavi, Bakar & Vakilian (2015)
Approximate results Accurate results Lee & Lee (2022)
Single-pass processing Multi-pass processing Maimon & Rokach (2005)
Retains essential data All data can be stored Bezerra et al. (2020)

This study organizes and categorizes the search results, revealing that the mainstream methods for streaming data research primarily include adaptations and variants of seminal algorithms, including KNN, iForest, ARIMA, and LOF. Furthermore, a significant body of work is devoted to hybrid approaches that synergistically combine multiple models. Table 4 below summarizes the respective strengths and limitations of streaming data anomaly detection algorithms based on traditional machine learning.

Table 4. Comparison of advantages and disadvantages of traditional machine learning-based streaming data anomaly detection algorithms.
Model Advantage Disadvantage References
Statistical Model Capable of modeling data, inferring relationships between variables Requires certain a priori assumptions, needs validation of model reliability, requires the selection of fitting data processing methods Hunt & Willett (2018), Tao & Michailidis (2019), Yu, Jibin & Jiang (2016)
Distance model Can mine data in-depth High requirements for data preprocessing, demanding distance measurement methods, sensitive to noise Zhu et al. (2020), Ma, Aminian & Kirby (2019), Miao et al. (2018)
Clustering model Broad applicability, robust interpretability Not suitable for high-dimensional or large-scale streaming data, sensitive to initial values, high requirements for preprocessing Lee & Lee (2022), Raut et al. (2023)
Density model Simple to implement, quickly reveals potential structures and robust to noise Suffers from the curse of dimensionality in high-dimensional data, computationally intensive for large-scale data Liu et al. (2020), Zhang, Zhao & Li (2019)
Isolation model Capable of modeling data distribution, suitable for complex data distributions Performance may decrease with high-dimensional data Liu, Ting & Zhou (2008)
Frequent item mining Effective at identifying outliers and anomalies in low-density areas, no need for labeled data, supports unsupervised learning Potential for false positives due to noise and outliers in dataset Cai et al. (2020a), Hao et al. (2019), Cai et al. (2020b)

Deep learning

Deep learning models are also crucial in streaming data analysis.

  • Reconstruction models: They detect anomalies by learning the reconstruction error of the input data. This process involves encoding the input data into a lower-dimensional latent space and then decoding this representation back into the original data space. Anomalies are identified by comparing the differences between the original and the reconstructed data. Yoo, Kim & Kim (2019) utilized a recurrent reconstruction network (RRN) for anomaly detection in temporal data. Zeng et al. (2023a) employed a stacked autoencoder (AE) to better distinguish between nuanced anomalies and subsequently enhanced detection accuracy through a joint optimization with KDE.

  • Generative models: Within a comprehensive taxonomic framework, generative paradigms constitute a sophisticated class of detection methodologies that transcend conventional pattern recognition. These models operate by cultivating the capacity to synthesize artificial data instances that mirror authentic distributions, subsequently facilitating anomaly identification through comparative analysis between authentic and synthetic data manifestations. Particularly prominent within this classification are variants of the Boltzmann machine and the generative adversarial network (GAN) architecture. Xing, Demertzis & Yang (2020) orchestrated a pioneering implementation of real-time evolving peak-constrained Boltzmann machine for anomaly detection within IoT, demonstrating how such approaches can be seamlessly integrated into the framework for real-time streaming analytics. Advancing this trajectory, Talapula et al. (2023) engineered an intricate fusion of search and rescue brain-storm optimization (SAR-BSO) with hybrid feature selection (FS) and deep belief network (DBN) classifiers, establishing a multilayered approach for the identification and localization of anomalous patterns within streaming log environments. The GAN architectural paradigm has undergone extensive refinement for anomaly detection, marked by pioneering contributions. Li et al. (2019) adeptly utilized long short-term memory-recurrent neural network (LSTM-RNN) frameworks to encapsulate intricate multivariate spatiotemporal interdependencies in cyber-physical systems. Hallaji, Razavi-Far & Saif (2022) ingeniously integrated dynamic temporal attributes of streaming data into GAN-based detection modules, significantly enhancing intrusion detection capabilities in Internet of Things (IoT) ecosystems. Grekov & Sychugov (2022) proposed sophisticated distributed processing paradigms, leveraging GAN architectures to synthesize realistic network traffic, thereby augmenting detection precision while concurrently alleviating computational demands in the analysis of voluminous network packets.

  • Prediction models: They primarily learn the intricate relationships between input data and target variables, formulating a sophisticated function approximation model. Anomalies are identified by comparing the differences between predicted and actual values. RNN and LSTM are frequently adopted as the cornerstone architectures for these models. Wang et al. (2023) developed an enhanced LSTM-AE to detect runtime reliability anomalies in MEC services based on distributional discrepancy evaluation. Liu et al. (2021) employed both standard and enhanced LSTM for the real-time monitoring and correction of aberrant data in IoT. Cheng et al. (2019) proposed a semi-supervised hierarchical stacked temporal convolutional network (TCN) to facilitate anomaly detection in smart home communication.

  • Representation models: They employ multi-layer neural networks to learn abstract feature representations, thereby capturing complex patterns and anomalous behaviors. Common models in this category utilize convolutional neural network (CNN) or graph neural network (GNN) as their underlying structures. Munir et al. (2018) used CNN to detect common periodic and seasonal outlier anomalies in streaming data. Garg et al. (2019) proposed a hybrid method based on grey wolf optimization (GWO) and CNN for anomaly detection in network traffic of cloud data centers.

Drawing upon recent investigations, this study systematically organizes and categorizes prevailing research methodologies for streaming data anomaly detection. These methods include using AE, VAE, GAN, RNN, CNN, and LSTM as the foundational structures, the integration of diverse model combinations, and advanced deep learning paradigms, including reinforcement learning (Zhou, Zhang & Hong, 2019), transfer learning (Wang et al., 2021). Table 5 summarizes the strengths and limitations of deep learning-based streaming data anomaly detection algorithms.

Table 5. Comparison of advantages and disadvantages of streaming data anomaly detection algorithm based on deep learning model.
Model Advantage Disadvantage References
Reconstruction model No need for labeled anomaly samples, can capture local and global features of data Higher false-positive rate for high-dimensional and large-scale data, requires a large amount of training data to learn data distribution and patterns Yoo, Kim & Kim (2019), Zeng et al. (2023a), Xu et al. (2023)
Generative model Can model complex, high-dimensional data distributions, can learn data distribution to generate new samples, no need for labeled data Training and inference processes are complex and time-consuming for complex data distributions and high-dimensional data, prone to mode collapse which can result in a lack of diversity in generated samples Xing, Demertzis & Yang (2020), Talapula et al. (2023), Li et al. (2019)
Predictive model Can capture dynamic changes and trends in data, excellent detection performance for time-series data Issues with gradient vanishing and exploding, need to continuously adapt to new data distributions for non-stationary data Wang et al. (2023), Liu et al. (2021)
Representation learning model Suitable for high-dimensional, complex, and large-scale data, better understanding of the intrinsic structure and features of data, can automatically extract useful features Requires a large amount of training data and computational resources to train deep neural networks, may have poor interpretability in some cases, prone to overfitting Munir et al. (2018), Garg et al. (2019)

Anomaly and output type

To facilitate systematic analysis and treatment of anomalies, it is essential to establish a comprehensive classification method that encompasses both anomaly types and output types.

  • Classification of anomaly types: Anomaly values are typically classified into three types: point anomaly, contextual anomaly, and collective anomaly (Gorunescu, 2011). Point anomaly refers to isolated data points that are markedly different from other data points. Contextual anomaly are data points that deviate from normal behavior or patterns within a specific context compared to other data points in given context. Collective anomaly refers to anomalies relative to the entire dataset.

  • Classification of output types: Output result types are generally divided into label and score (Chandola, Banerjee & Kumar, 2009). Anomaly labels allow for direct determination of whether each point is an anomaly based on the model’s output. Anomaly scores provide further insight into which points exhibit a higher degree of anomaly.

Study of literature and discussions

In the above discussions, this investigation delineated the datasets utilized for detection purposes, the algorithms for measurement methodologies, the detection algorithms, and the types of anomaly. Subsequently, this study will concentrate on the methodologies for measuring streaming data. This study will conduct a theoretical appraisal of literature, guided by a suite of formulated evaluative criteria, culminating in an extensive discourse.

Evaluation criteria

This research proposes a comprehensive set of evaluation criteria for assessing existing research. As shown in Table 6, these criteria encompass a broad spectrum of research outcomes. Specifically, the evaluation criteria mainly include efficiency metrics, evaluation metrics, task metrics, and anomaly interpretability.

Table 6. Literature evaluation of streaming data detection in the field of network security.

Ref. AD CD EC DP Explain Effectiveness Output Type
Bhatia et al. (2022) Statistical Model * AUC, ROC, Acc, Recall Batch Scalable Score Point
Tong & Prasanna (2017) FIM * Precision, Recall Stream Adaptive Score Point
Hao et al. (2019) Cluster, LSTM, AE Deland, FPR, ROC-AUC, Recall Stream Robust Cluster label Contextual
Hoeltgebaum, Adams & Fernandes (2021) Statistical Model FD, FP, FN, MSE, MAE Stream Adaptive Label Point
Nadler, Aminov & Shabtai (2019) iForest * DR, FPR Batch Robust Score Collective
Wambura, Huang & Li (2022) RNN * MAE, ROC-AUC Stream Scalable, robust Score *
Xiaolan et al. (2022) Cluster DR, FAR, Acc Stream Robust No Collective
Yin, Li & Yin (2020) Statistical Model * EDR, EFP, END, ENF, EFR * Robust Score Point
Zeng et al. (2023b) Bloom Filter * DR, FAR Stream Robust Label Point
Wahab (2022) DNN Precision, Recall, F1, TP, FP, FN, TN, Acc * Robust Label Collective
Cai et al. (2022) K-Means, Cluster Acc Stream Robust Label All
Cheng et al. (2020) TCN * Acc, Precision, Recall, F1 Batch * Robust Label Collective
Jain, Kaur & Saxena (2022) K-Means, Cluster, SVM Acc, FAR, Precision, Recall, F1, Kappa Statistic Stream * Adaptive, robust Label Collective
Mirsky et al. (2017) Cluster ROC, AUC, TPR, FPR Stream * Robust Score All
Scaranti et al. (2022) DBSCAN, Entropy Acc, Precision, Recall, F-measure, FAR Stream Adaptive Label All
Shao et al. (2023) Bloom filter Acc Stream * Robust Label All
Xing, Demertzis & Yang (2020) e-SNN, REBOM K-Stats, K-Temp-Stats Stream * Robust, scalable Label Point
Xu et al. (2023) AE, SVM AUC Stream Robust Score Point
Yang et al. (2021) XGboost AUC Stream Score Point
Zeng et al. (2023a) KDE, AE Recall, Precision, F-score, ROC, TPR, FPR, AUC Stream Scalable, adaptive, robust Score Point
Zhou et al. (2020) Variational LSTM * Precision, Recall, F1, FAR, AUC Batch Scalable Label Point
Saheed, Abdulganiyu & Tchakoucht (2023) GWO, ELM, PCA * Precision, Recall, DR, Acc Batch * Scalable, robust Label Point
Yoon et al. (2022) AE AUC Batch Scalable, adaptive, robust Score Point
Yu et al. (2018) Cluster * AUC, Acc Stream Scalable, adaptive, robust Score Point

Note:

AD, anomaly detection; CD, concept drift; EC, evaluation criteria, DP, data processing mode; FIM, frequency itemset mining; FD, false detection; MAS, mean average score; EDR, event detection rate; EFP, event false positive rate; END, error node detection rate; ENF, error node false positive rate; EFR, error node false recognition rate; *, unknown; ✗, not support; ✓, support.

  • Efficiency metrics: Efficiency metrics serve to quantify the efficiency of algorithm, particularly with respect to their algorithmic complexity.

  • Evaluation metrics: Evaluation metrics are utilized to evaluate the efficacy of algorithm. They primarily encompass the following: false alarm rate (FAR), recall, detection rate (DR), true positive rate (TPR), true negative rate (TNR), receiver operating characteristic curve (ROC), and area under curve (AUC). In addition to the above evaluation metrics, additional indices, such as the Youden’s index (Harush, Meidan & Shabtai, 2021) and the Kappa coefficient (Xing, Demertzis & Yang, 2020; Jain, Kaur & Saxena, 2022) are also used as reference standards.

  • Task metrics: Task metrics assess algorithm’s capability in managing concept drift and feature drift, as well as its aptitude for handling high-dimensional or large-scale streaming datasets.

  • Anomaly explanation: Anomaly explanation indicates whether algorithm can provide explanations.

Detection algorithm discussion

The vast majority of traditional machine learning models are lightweight, meaning they have compact model size and low memory requirements. Therefore, these algorithms are widely used in resource-constrained environments.

Statistical models, valued for their parsimony and straightforward implementation, have demonstrated considerable utility across a multitude of applications. Hunt & Willett (2018) adapted gaussian mixture model for online anomaly detection in wide-area motion imagery and email databases. Tao & Michailidis (2019) used higher-order statistical techniques to detect false data injection (FDI) attacks in power systems. Ma, Aminian & Kirby (2019) applied ARIMA to detect anomalous traffic in WSN. Nevertheless, a principal constraint of such statistical methodologies is their inherent reliance on specific distributional assumptions. This prerequisite can substantially limit their effectiveness when confronted with the complexity of real-world data, where such assumptions may not be valid.

Distance models facilitate localization and identification of anomalies by leveraging distance or similarity measures. Zhu et al. (2020) employed min heap to compute upper bounds or lower bounds of distances between objects and their k-th nearest neighbors for anomaly detection in IoT. Miao et al. (2018) implemented online OCSVM in distributed WSN. However, distance-based models often face the curse of dimensionality and high computational costs when dealing with high dimensional or large-scale data. It is essential to adopt suitable dimensionality reduction methods and optimize distance metrics to ease these challenges.

Cluster models adeptly adapt to intrinsic structural dynamics, rendering them well-suited for addressing distributional shifts in streaming data. Jain, Kaur & Saxena (2022), ZareMoodi, Kamali Siahroudi & Beigy (2019) used cluster-based models to address concept drift. Wang et al. (2022) analyzed the influence of collective anomalies based on cluster. Zou et al. (2023) combined grid cluster with gaussian model to improve the ability to distinguish noise from anomalies. Harush, Meidan & Shabtai (2021) integrated cluster-based methods with deep learning to classify contextual information in real time. Bah et al. (2019) employed micro-clusters to refine the search space for outlier detection. Wang et al. (2020b) utilized the centers of micro-cluster within each class as inputs for detecting unknown classes by projecting micro-cluster centers onto fixed positions on orthogonal axes in the feature space, forming clear classification boundaries.

Density models effectively address local anomaly detection in high-dimensional or large-scale streaming data with with reduced memory requirements. Liu et al. (2020) introduced a KDE-based outlier detection method that substantially accelerates processing speed through an upper-bound pruning strategy. Chen, Wang & Yang (2021) employed entropy-weighted index calculation and reachable distance factor discrimination methods, achieving a 15% improvement in accuracy while requiring only 1% of the runtime compared to the traditional LOF. Density-based models can adapt to the evolution of streaming data, but they have high computational complexity and are challenging to apply to high-dimensional sparse or large-scale streaming data.

Isolation models exhibit robust real-time capability (Shao et al., 2020). However, they are characterized by high computational complexity and sensitive to model parameters. Consequently, the development of more efficient isolation frameworks is imperative to improve detection speed.

Frequency item mining algorithms can uncover common patterns in streaming data, thereby enhancing the recognition and pinpointing of anomalous samples. Cai et al. (2020a, 2020b), Hao et al. (2019) stored uncertain streaming data information in matrices to detect outliers. Compared to conventional static data frequent item mining algorithms, these approaches are more suitable for processing large-scale uncertain streaming data. Owing to its ability to handle complex and large-scale data, automatically learn relevant features from streaming data, and provide automatic feature extraction, scalability, adaptability and handling of complex relationships in streaming data, deep learning-based algorithms are becoming increasingly popular.

Reconstruction models effectively learn high-level feature representations of data without the need for manual feature design. These models provide reconstruction errors or losses for anomalous data, making the models and results easier to understand and interpret. Yoo, Kim & Kim (2019), Zeng et al. (2023a) used AE to detect anomalies by analyzing the reconstruction errors between the reconstructed and original sequences. Nevertheless, reconstruction models have high complexity and require enough normal data samples as training sets to establish the distribution model for normal data.

Generative models not only capture complex streaming data distributions but also generate new samples based on the learned distributions. The GAN architecture stands as a paradigmatic exemplar in this domain, demonstrating remarkable efficacy in anomaly detection. Li et al. (2019) advanced the conventional GAN framework by integrating LSTM-RNN as its foundational architecture, thereby exploiting the intricate spatiotemporal correlations and multivariate dependencies inherent in sequential data streams. Extending this trajectory of innovation, Hallaji, Razavi-Far & Saif (2022) engineered an advanced approach that incorporates the dynamic temporal characteristics of streaming data directly into the GAN detector framework, yielding substantial improvements in intrusion detection precision within IoT ecosystems. In conjunction with GAN-based methodologies, restricted Boltzmann machines (RBMs) represent an equally viable analytical tool within the proposed classification paradigm. Xing, Demertzis & Yang (2020) pioneered enhancements in anomaly detection fidelity through the introduction of real-time evolving spiking RBM architectures, which demonstrate particular aptitude for processing high-velocity data streams. Further advancing this architectural lineage, Talapula et al. (2023) orchestrated a sophisticated fusion of RBMs with deep belief criteria and metaheuristic algorithms, specifically calibrated for generative streaming log data analysis, thereby substantially augmenting the model’s discriminative capabilities for anomaly identification.

Prediction models with their powerful expressive capabilities and adaptability can capture temporal dependencies and patterns. Wang et al. (2023) combined LSTM prediction models with AE to compress time series data into low-dimensional feature models. Compared to employing a single prediction-based model, combining with other types of deep learning models improves the efficiency and effectiveness of real-time streaming data detection. Liu et al. (2021) used LSTM+ for real-time monitoring and correction of streaming data in IoT. Similarly, Cheng et al. (2019) stacked distance-based models such as KNN and SVM, with probabilistic models, includeing decision trees and Bayesian classifiers and combined them with TCN for sequence problems.

Representation models can capture the underlying features and intrinsic structures of streaming data. Munir et al. (2018) used a CNN-based representation learning model, termed DeepAnT, to address periodic and seasonal anomalies that cannot be solved by distance-based and density-based anomaly detection techniques in traditional machine learning. Garg et al. (2019) improved the standard CNN employing a uniform distribution method for anomaly detection in heterogeneous streaming data.

Measurement algorithm discussion

Frequency estimation can estimate the occurrence frequency of various elements in streaming data. Consistent hashing and counters can efficiently estimate element frequencies in a single pass, characterized by minimal time and space complexity. These approaches yield error bounds for frequency approximation, meeting the real-time detection requirements. However, there is a certain estimation error compared to exact calculations due to the use of approximate counters.

Quantile estimation elucidates the global distribution of streaming data by calculating the approximate quantiles in real time. However, in cases where the data distribution in streaming is extremely uneven, the estimation of tail quantiles may be less accurate (Liu et al., 2018).

Change detection employs chi-square tests or the KL divergence to detect the statistical distribution and aggregate metrics of streaming data. However, this approach is sensitive to parameter settings (Kuncheva, 2011).

Cardinality estimation leverages probabilistic data structures to efficiently estimate the cardinality of streaming data in a single pass. However, compared to frequency estimation, real-time performance may be inferior (Jie et al., 2022).

Similarity calculation utilizes LSH or other algorithms to rapidly retrieve neighbors and compute approximate similarities in sublinear time in streaming data. However, similarity calculations for high-dimensional and large-scale streaming data may be time-consuming, requiring appropriate similarity measurement methodologies and thresholds (Wu et al., 2024).

Sampling techniques represent a critical dimension in the study taxonomy of streaming data anomaly detection methods, as they extract representative subsets from streaming data, significantly reducing storage and processing requirements. While effective sampling demands meticulously designed random or stratified strategies, sketch algorithms have emerged as particularly promising within this category due to their sublinear complexity and efficiency advantages over traditional item-by-item processing approaches. Recent advancements in sketch algorithms demonstrate their increasing relevance to the framework of streaming anomaly detection. Liu et al. (2016) developed a universal monitoring framework that exemplifies how sketching can provide system-wide visibility while preserving computational efficiency. Building on this foundation, Yang et al. (2018) addressed the challenge of varying traffic conditions through adaptive sketching methods, a critical capability for detecting anomalies in dynamically evolving streams. To address the need for scalable detection in high-throughput network scenarios, Huang et al. (2017) proposed a robust measurement framework. The evolution of sketch-based algorithms has focused primarily on two aspects central to effective anomaly detection: accuracy and performance. Liu et al. (2015) tackled the fundamental challenge of hash collisions by reconstructing and estimating infected host cardinality using overlapping techniques of hash bit strings based on vectorized Bloom Filter. This approach significantly enhances detection accuracy, addressing one of the key challenges identified in the study framework. More recently, Zeng et al. (2023b) advanced this concept by establishing a multi-layer dLSHBF model using Bloom Filter, which effectively avoids element conflicts by reducing data hash encoding length. Similarly, Xiao et al. (2023) employed a multi-level design methodology combined with TailCut’s register compression technique (Xiao et al., 2020) to alleviate hash collisions between streams, demonstrating how hash function selection critically impacts detection algorithm performance. Within this study comprehensive framework of streaming data anomaly detection methodologies, these sketch-based approaches represent a particularly valuable direction for network security applications, offering an optimal balance between computational efficiency and detection accuracy.

Future directions and open research challenges

Although many anomaly detection algorithms for streaming data have been proposed, there are still some challenges and limitations that make it impractical to use these algorithms in the real world. Therefore, in this section, the study will describe the limitations of existing research literature and propose research questions and suggestions in the context of network security.

Limitations of current research

This study surveys the latest literature on anomaly detection in streaming data and discusses some issues in the research direction of streaming data in the field of network security. During the research process, this study verified that the existing literature in this field has not adequately addressed the relevant problems, thus requiring more attention. Based on the previous statements and research, the following general limitations of existing research literature can be identified:

  1. The field of cybersecurity has not adequately addressed the issue of high-dimensional streaming data.

  2. The interpretability of detected anomalies in streaming data is poor.

  3. There is still significant room for improvement in the efficiency of these models or methods.

  4. There is a lack of methods for handling anomalies in multi-type data.

  5. There is a scarcity of datasets appropriate for streaming data in the field of network security.

Future research directions

Anomaly detection in high-dimensional streaming data: Anomaly detection in high-dimensional streaming data typically requires a significant amount of feature engineering. Deep learning can automatically learn relevant features, with its multi-layer structure, proficiently manage high-dimensional streaming data enhance its effectiveness by learning more abstract representations through the hierarchical extraction and combination of features. However, the application of deep learning algorithms in the field of cybersecurity, especially in high-dimensional streaming data, has not been extensively studied. Therefore, applying deep learning to anomaly detection in streaming data is a direction worth exploring. It is worth noting that high-dimensional streaming data requires more hidden layers for learning input features, resulting in a linear increase in model computational complexity with the increase of hidden layers.

Detection of anomalies in multi-type data: Different applications generate diverse streaming data, such as text streaming, image streaming, video streaming, etc. Detecting anomalies in these heterogeneous streaming data can be cumbersome. Therefore, there is a need to research new algorithms or techniques to handle these complex data formats to better understand streaming data in the real world.

Interpretability of anomaly: Recent studies have underscored the significance of anomaly interpretation, particularly in the context of streaming data. The aim is to unearth plausible explanations for detected abnormal patterns. This will help in understanding and evaluating relevant detection results and enhance the reliability of anomaly evaluation. So far, the existing literature has focused on interpreting anomalies in low-dimensional streaming data, thus requiring further investigation in this area.

Improvement of detection algorithm efficiency: The characteristics of streaming data require anomaly detection algorithms to produce results at a lower computational cost, making the efficiency of algorithms crucial. Due to limitations in time and storage space, it is worth considering compressing the streaming data first and then combining traditional machine learning algorithms or deep learning models for anomaly detection. This approach can improve efficiency while ensuring detection accuracy.

Application of large language models: Large language models (LLMs) offer a transformative approach to anomaly detection in network security streams by enabling the analysis of textual and semi-structured data at a deep semantic level. By processing diverse data sources, such as network logs, system alerts, or command-line sequences, LLMs can harness their sophisticated contextual comprehension to detect subtle and intricate anomalies that often elude conventional detection methodologies. For example, they are capable of identifying emerging phishing campaigns or multi-stage insider threats by analyzing the narrative flow and contextual coherence of communication streams. Despite this immense potential, their practical application remains an underexplored research area, primarily due to significant challenges. The high computational requirements and inference latency of LLMs pose a barrier to real-time analysis, while the need for curated, domain-specific datasets for effective fine-tuning presents another obstacle. Therefore, future research should be strategically directed towards enhancing the viability of LLMs for this task. Investigating lightweight architectures, such as distilled or pruned models, and exploring sophisticated transfer learning methodologies represent promising pathways to harness the power of LLMs for real-time, adaptive anomaly detection in cybersecurity.

Conclusions

This survey provides a comprehensive overview of streaming data anomaly detection in network security, systematically categorizing existing research based on datasets, measurement techniques, detection algorithms, anomaly types, and output types. Through this categorization, evaluation criteria have been derived to assess the characteristics, advantages, and limitations of various research approaches. It has been established that streaming data anomaly detection is critical in addressing the challenges posed by the increasing volume and complexity of data in cybersecurity applications. Several key challenges have been identified, including the need for efficient algorithms to handle high-dimensional and multi-type data, improved interpretability of anomalies, and the exploration of emerging techniques such as LLMs. It is anticipated that this survey will enhance understanding of this vital research area and provide valuable guidance for future investigations. Future research is expected to focus on two primary directions. First, a more in-depth analysis of mainstream literature is planned to offer scholars and practitioners a thorough understanding of current research developments. Second, it is proposed to develop a novel anomaly detection method to address the limitations observed in existing approaches, leveraging the unique characteristics of streaming data to improve efficiency. Specifically, an efficient anomaly detection algorithm will be designed and its performance and accuracy validated through rigorous experiments.

Funding Statement

The author received no funding for this work.

Additional Information and Declarations

Competing Interests

The author declares that they have no competing interests.

Author Contributions

Pengju Zhou conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the article, and approved the final draft.

Data Availability

The following information was supplied regarding data availability:

This is a literature review.

References

  • Aggarwal & Yu (2008).Aggarwal CC, Yu PS. Outlier detection with uncertain data. Proceedings of the 2008 SIAM International Conference on Data Mining; Philadelphia: SIAM; 2008. pp. 483–493. [Google Scholar]
  • Bagozi, Bianchini & De Antonellis (2021).Bagozi A, Bianchini D, De Antonellis V. Multi-level and relevance-based parallel clustering of massive data streams in smart manufacturing. Information Sciences. 2021;577(9):805–823. doi: 10.1016/j.ins.2021.08.039. [DOI] [Google Scholar]
  • Bah et al. (2019).Bah MJ, Wang H, Hammad M, Zeshan F, Aljuaid H. An effective minimal probing approach with micro-cluster for distance-based outlier detection in data streams. IEEE Access. 2019;7:154922–154934. doi: 10.1109/ACCESS.2019.2946966. [DOI] [Google Scholar]
  • Barbariol et al. (2022).Barbariol T, Chiara FD, Marcato D, Susto GA. A review of tree-based approaches for anomaly detection. Control Charts and Machine Learning for Anomaly Detection in Manufacturing; Cham: Springer; 2022. pp. 149–185. [Google Scholar]
  • Barddal et al. (2017).Barddal JP, Gomes HM, Enembreck F, Pfahringer B. A survey on feature drift adaptation: definition, benchmark, challenges and future directions. Journal of Systems and Software. 2017;127(1):278–294. doi: 10.1016/j.jss.2016.07.005. [DOI] [Google Scholar]
  • Bezerra et al. (2020).Bezerra CG, Costa BSJ, Guedes LA, Angelov PP. An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Information Sciences. 2020;518(2):13–28. doi: 10.1016/j.ins.2019.12.022. [DOI] [Google Scholar]
  • Bhatia et al. (2022).Bhatia S, Liu R, Hooi B, Yoon M, Shin K, Faloutsos C. Real-time anomaly detection in edge streams. ACM Transactions on Knowledge Discovery from Data (TKDD) 2022;16(4):1–22. doi: 10.1145/3494564. [DOI] [Google Scholar]
  • Bhaya & Alasadi (2016).Bhaya WS, Alasadi SA. Anomaly detection in network traffic using stream data mining. Research Journal of Applied Sciences. 2016;11(10):1076–1082. [Google Scholar]
  • Bigdeli et al. (2018).Bigdeli E, Mohammadi M, Raahemi B, Matwin S. Incremental anomaly detection using two-layer cluster-based structure. Information Sciences. 2018;429:315–331. doi: 10.1016/j.ins.2017.11.023. [DOI] [Google Scholar]
  • Bonvini et al. (2014).Bonvini M, Sohn MD, Granderson J, Wetter M, Piette MA. Robust on-line fault detection diagnosis for HVAC components based on nonlinear state estimation techniques. Applied Energy. 2014;124(3):156–166. doi: 10.1016/j.apenergy.2014.03.009. [DOI] [Google Scholar]
  • Boukerche, Zheng & Alfandi (2020).Boukerche A, Zheng L, Alfandi O. Outlier detection: methods, models, and classification. ACM Computing Surveys (CSUR) 2020;53(3):1–37. doi: 10.1145/3381028. [DOI] [Google Scholar]
  • Cai et al. (2022).Cai S, Chen J, Yin B, Sun R, Zhang C, Chen H, Chen J, Lin M. An efficient outlier detection approach for streaming sensor data based on neighbor difference and clustering. Security and Communication Networks. 2022;2022(1):3062541. doi: 10.1155/2022/3062541. [DOI] [Google Scholar]
  • Cai et al. (2020a).Cai S, Li S, Yuan G, Hao S, Sun R. MiFI-Outlier: minimal infrequent itemset-based outlier detection approach on uncertain data stream. Knowledge-Based Systems. 2020a;191(3):105268. doi: 10.1016/j.knosys.2019.105268. [DOI] [Google Scholar]
  • Cai et al. (2020b).Cai S, Sun R, Hao S, Li S, Yuan G. Minimal weighted infrequent itemset mining-based outlier detection approach on uncertain data stream. Neural Computing and Applications. 2020b;32(11):6619–6639. doi: 10.1007/s00521-018-3876-4. [DOI] [Google Scholar]
  • Cao et al. (2006).Cao F, Estert M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. Proceedings of the 2006 SIAM International Conference on Data Mining; Philadelphia: SIAM; 2006. pp. 328–339. [Google Scholar]
  • Chandola, Banerjee & Kumar (2009).Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Computing Surveys (CSUR) 2009;41(3):1–58. doi: 10.1145/1541880.1541882. [DOI] [Google Scholar]
  • Chauhan & Shukla (2015).Chauhan P, Shukla M. A review on outlier detection techniques on data stream by using different approaches of K-means algorithm. 2015 International Conference on Advances in Computer Engineering and Applications; Piscataway: IEEE; 2015. pp. 580–585. [Google Scholar]
  • Chen, Wang & Yang (2021).Chen L, Wang W, Yang Y. CELOF: effective and fast memory efficient local outlier detection in high-dimensional data streams. Applied Soft Computing. 2021;102(12):107079. doi: 10.1016/j.asoc.2021.107079. [DOI] [Google Scholar]
  • Cheng et al. (2019).Cheng Y, Xu Y, Zhong H, Liu Y. HS-TCN: a semi-supervised hierarchical stacking temporal convolutional network for anomaly detection in IoT. 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC); Piscataway: IEEE; 2019. [Google Scholar]
  • Cheng et al. (2020).Cheng Y, Xu Y, Zhong H, Liu Y. Leveraging semisupervised hierarchical stacking temporal convolutional network for anomaly detection in IoT communication. IEEE Internet of Things Journal. 2020;8(1):144–155. doi: 10.1109/JIOT.2020.3000771. [DOI] [Google Scholar]
  • Chouliaras & Sotiriadis (2019).Chouliaras S, Sotiriadis S. Real-time anomaly detection of NoSQL systems based on resource usage monitoring. IEEE Transactions on Industrial Informatics. 2019;16(9):6042–6049. doi: 10.1109/TII.2019.2958606. [DOI] [Google Scholar]
  • Clever et al. (2022).Clever L, Pohl JS, Bossek J, Kerschke P, Trautmann H. Process-oriented stream classification pipeline: a literature review. Applied Sciences. 2022;12(18):9094. doi: 10.3390/app12189094. [DOI] [Google Scholar]
  • Cormode & Muthukrishnan (2005).Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms. 2005;55(1):58–75. doi: 10.1016/j.jalgor.2003.12.001. [DOI] [Google Scholar]
  • Dal Pozzolo et al. (2015).Dal Pozzolo A, Boracchi G, Caelen O, Alippi C, Bontempi G. Credit card fraud detection and concept-drift adaptation with delayed supervised information. 2015 International Joint Conference on Neural Networks (IJCNN); Piscataway: IEEE; 2015. [Google Scholar]
  • Degirmenci & Karal (2022).Degirmenci A, Karal O. Efficient density and cluster based incremental outlier detection in data streams. Information Sciences. 2022;607:901–920. doi: 10.1016/j.ins.2022.06.013. [DOI] [Google Scholar]
  • Din et al. (2021).Din SU, Shao J, Kumar J, Mawuli CB, Mahmud SH, Zhang W, Yang Q. Data stream classification with novel class detection: a review, comparison and challenges. Knowledge and Information Systems. 2021;63(9):2231–2276. doi: 10.1007/s10115-021-01582-4. [DOI] [Google Scholar]
  • Duarte, Gama & Bifet (2016).Duarte J, Gama J, Bifet A. Adaptive model rules from high-speed data streams. ACM Transactions on Knowledge Discovery from Data (TKDD) 2016;10(3):1–22. doi: 10.1145/2829955. [DOI] [Google Scholar]
  • Fahy, Yang & Gongora (2022).Fahy C, Yang S, Gongora M. Scarcity of labels in non-stationary data streams: a survey. ACM Computing Surveys (CSUR) 2022;55(2):1–39. doi: 10.1145/3494832. [DOI] [Google Scholar]
  • Faria et al. (2016).Faria ER, Gonçalves IJ, de Carvalho AC, Gama J. Novelty detection in data streams. Artificial Intelligence Review. 2016;45(2):235–269. doi: 10.1007/s10462-015-9444-8. [DOI] [Google Scholar]
  • Ferrag et al. (2022).Ferrag MA, Friha O, Hamouda D, Maglaras L, Janicke H. Edge-IIoTset: a new comprehensive realistic cyber security dataset of IoT and IIoT applications for centralized and federated learning. IEEE Access. 2022;10:40281–40306. doi: 10.1109/ACCESS.2022.3165809. [DOI] [Google Scholar]
  • Garcia et al. (2014).Garcia S, Grill M, Stiborek J, Zunino A. An empirical comparison of botnet detection methods. Computers & Security. 2014;45:100–123. doi: 10.1016/j.cose.2014.05.011. [DOI] [Google Scholar]
  • Garg et al. (2019).Garg S, Kaur K, Kumar N, Kaddoum G, Zomaya AY, Ranjan R. A hybrid deep learning-based model for anomaly detection in cloud datacenter networks. IEEE Transactions on Network and Service Management. 2019;16(3):924–935. doi: 10.1109/TNSM.2019.2927886. [DOI] [Google Scholar]
  • Gokcesu et al. (2018).Gokcesu K, Neyshabouri MM, Gokcesu H, Kozat SS. Sequential outlier detection based on incremental decision trees. IEEE Transactions on Signal Processing. 2018;67(4):993–1005. doi: 10.1109/TSP.2018.2887406. [DOI] [Google Scholar]
  • Gorunescu (2011).Gorunescu F. Data mining: concepts, models and techniques. Berlin, Heidelberg: Springer; 2011. [Google Scholar]
  • Grekov & Sychugov (2022).Grekov M, Sychugov A. Distributed detection of anomalies in the network flow using generative adversarial networks. 2022 International Russian Automation Conference (RusAutoCon); Piscataway: IEEE; 2022. pp. 332–336. [Google Scholar]
  • Grubbs (1969).Grubbs FE. Procedures for detecting outlying observations in samples. Technometrics. 1969;11(1):1–21. doi: 10.1080/00401706.1969.10490657. [DOI] [Google Scholar]
  • Gurjar & Chhabria (2015).Gurjar GS, Chhabria S. A review on concept evolution technique on data stream. 2015 International Conference on Pervasive Computing (ICPC); Piscataway: IEEE; 2015. [Google Scholar]
  • Hallaji, Razavi-Far & Saif (2022).Hallaji E, Razavi-Far R, Saif M. Generative Adversarial Learning: Architectures and Applications. Cham: Springer; 2022. Embedding time-series features into generative adversarial networks for intrusion detection in internet of things networks; pp. 169–183. [Google Scholar]
  • Hao et al. (2019).Hao S, Cai S, Sun R, Li S. An efficient outlier detection approach over uncertain data stream based on frequent itemset mining. Information Technology and Control. 2019;48:34–46. doi: 10.1016/j.eswa.2020.113646. [DOI] [Google Scholar]
  • Harush, Meidan & Shabtai (2021).Harush S, Meidan Y, Shabtai A. DeepStream: autoencoder-based stream temporal clustering and anomaly detection. Computers & Security. 2021;106(3):102276. doi: 10.1016/j.cose.2021.102276. [DOI] [Google Scholar]
  • Heigl et al. (2021).Heigl M, Anand KA, Urmann A, Fiala D, Schramm M, Hable R. On the improvement of the isolation forest algorithm for outlier detection with streaming data. Electronics. 2021;10(13):1534. doi: 10.3390/electronics10131534. [DOI] [Google Scholar]
  • Hoeltgebaum, Adams & Fernandes (2021).Hoeltgebaum H, Adams N, Fernandes C. Estimation, forecasting, and anomaly detection for nonstationary streams using adaptive estimation. IEEE Transactions on Cybernetics. 2021;52(8):7956–7967. doi: 10.1109/TCYB.2021.3054161. [DOI] [PubMed] [Google Scholar]
  • Huang et al. (2017).Huang Q, Jin X, Lee PP, Li R, Tang L, Chen YC, Zhang G. Sketchvisor: robust network measurement for software packet processing. Proceedings of the Conference of the ACM Special Interest Group on Data Communication; New York: ACM; 2017. pp. 113–126. [Google Scholar]
  • Hunt & Willett (2018).Hunt XJ, Willett R. Online data thinning via multi-subspace tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018;41(5):1173–1187. doi: 10.1109/TPAMI.2018.2829189. [DOI] [PubMed] [Google Scholar]
  • Jain, Kaur & Saxena (2022).Jain M, Kaur G, Saxena V. A K-Means clustering and SVM based hybrid concept drift detection technique for network anomaly detection. Expert Systems with Applications. 2022;193(11):116510. doi: 10.1016/j.eswa.2022.116510. [DOI] [Google Scholar]
  • Jie et al. (2022).Jie X, Haoliang L, Wei D, Ao J. Network host cardinality estimation based on artificial neural network. Security and Communication Networks. 2022;2022(1):1258482. doi: 10.1155/2022/1258482. [DOI] [Google Scholar]
  • Karim, Majumdar & Darabi (2020).Karim F, Majumdar S, Darabi H. Adversarial attacks on time series. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020;43(10):3309–3320. doi: 10.1109/TPAMI.2020.2986319. [DOI] [PubMed] [Google Scholar]
  • Krawczyk et al. (2017).Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M. Ensemble learning for data stream analysis: a survey. Information Fusion. 2017;37(2):132–156. doi: 10.1016/j.inffus.2017.02.004. [DOI] [Google Scholar]
  • Kulesza et al. (2014).Kulesza T, Amershi S, Caruana R, Fisher D, Charles D. Structured labeling for facilitating concept evolution in machine learning. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; New York: ACM; 2014. pp. 3075–3084. [Google Scholar]
  • Kuncheva (2011).Kuncheva LI. Change detection in streaming multivariate data using likelihood detectors. IEEE Transactions on Knowledge and Data Engineering. 2011;25(5):1175–1180. doi: 10.1109/TKDE.2011.226. [DOI] [Google Scholar]
  • Lai et al. (2022).Lai YK, Tsai CL, Chuang CH, Ku XW, Chen JH. Tabular interpolation approach based on stable random projection for estimating empirical entropy of high-speed network traffic. IEEE Access. 2022;10:104934–104953. doi: 10.1109/ACCESS.2022.3210336. [DOI] [Google Scholar]
  • Laleh & Abdollahi Azgomi (2010).Laleh N, Abdollahi Azgomi M. A hybrid fraud scoring and spike detection technique in streaming data. Intelligent Data Analysis. 2010;14(6):773–800. doi: 10.3233/IDA-2010-0451. [DOI] [Google Scholar]
  • Lall (2015).Lall A. Data streaming algorithms for the Kolmogorov-Smirnov test. 2015 IEEE International Conference on Big Data (Big Data); Piscataway: IEEE; 2015. pp. 95–104. [Google Scholar]
  • Lee & Lee (2022).Lee G, Lee K. Online dependence clustering of multivariate streaming data using one-class SVMs. International Journal of Intelligent Systems. 2022;37(6):3682–3708. doi: 10.1002/int.22716. [DOI] [Google Scholar]
  • Li et al. (2019).Li D, Chen D, Jin B, Shi L, Goh J, Ng SK. MAD-GAN: multivariate anomaly detection for time series data with generative adversarial networks. 2019. ArXiv. [DOI]
  • Liu et al. (2021).Liu J, Bai J, Li H, Sun B. Improved LSTM-based abnormal stream data detection and correction system for internet of things. IEEE Transactions on Industrial Informatics. 2021;18(2):1282–1290. doi: 10.1109/TII.2021.3079504. [DOI] [Google Scholar]
  • Liu et al. (2016).Liu Z, Manousis A, Vorsanger G, Sekar V, Braverman V. One sketch to rule them all: rethinking network flow monitoring with UnivMon. Proceedings of the 2016 ACM SIGCOMM Conference; New York: ACM; 2016. pp. 101–114. [Google Scholar]
  • Liu et al. (2015).Liu W, Qu W, Gong J, Li K. Detection of superpoints using a vector bloom filter. IEEE Transactions on Information Forensics and Security. 2015;11(3):514–527. doi: 10.1109/TIFS.2015.2503269. [DOI] [Google Scholar]
  • Liu, Ting & Zhou (2008).Liu FT, Ting KM, Zhou ZH. Isolation forest. 2008 Eighth IEEE International Conference on Data Mining; Piscataway: IEEE; 2008. pp. 413–422. [Google Scholar]
  • Liu et al. (2020).Liu F, Yu Y, Song P, Fan Y, Tong X. Scalable KDE-based top-n local outlier detection over large-scale data streams. Knowledge-Based Systems. 2020;204(9):106186. doi: 10.1016/j.knosys.2020.106186. [DOI] [Google Scholar]
  • Liu et al. (2018).Liu J, Zheng W, Lin Z, Lin N. Accurate quantile estimation for skewed data streams using nonlinear interpolation. IEEE Access. 2018;6:28438–28446. doi: 10.1109/ACCESS.2018.2837906. [DOI] [Google Scholar]
  • Ma, Aminian & Kirby (2019).Ma X, Aminian M, Kirby M. Error-adaptive modeling of streaming time-series data using radial basis functions. Journal of Computational and Applied Mathematics. 2019;362(1):295–308. doi: 10.1016/j.cam.2018.10.056. [DOI] [Google Scholar]
  • Maimon & Rokach (2005).Maimon O, Rokach L. Data mining and knowledge discovery handbook. Berlin, Heidelberg: Springer-Verlag; 2005. [Google Scholar]
  • Martínez-Rego et al. (2015).Martínez-Rego D, Fernández Francos D, Fontenla Romero O, Alonso-Betanzos A. Stream change detection via passive-aggressive classification and Bernoulli CUSUM. Information Sciences. 2015;305(46):130–145. doi: 10.1016/j.ins.2015.01.022. [DOI] [Google Scholar]
  • Miao et al. (2018).Miao X, Liu Y, Zhao H, Li C. Distributed online one-class support vector machine for anomaly detection over networks. IEEE Transactions on Cybernetics. 2018;49(4):1475–1488. doi: 10.1109/TCYB.2018.2804940. [DOI] [PubMed] [Google Scholar]
  • Mirsky et al. (2017).Mirsky Y, Shabtai A, Shapira B, Elovici Y, Rokach L. Anomaly detection for smartphone data streams. Pervasive and Mobile Computing. 2017;35(2):83–107. doi: 10.1016/j.pmcj.2016.07.006. [DOI] [Google Scholar]
  • Mousavi, Bakar & Vakilian (2015).Mousavi M, Bakar AA, Vakilian M. Data stream clustering algorithms: a review. 2015. ArXiv. [DOI]
  • Moustafa & Slay (2015).Moustafa N, Slay J. UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). 2015 Military Communications and Information Systems Conference (MilCIS); Piscataway: IEEE; 2015. [Google Scholar]
  • Moustafa, Turnbull & Choo (2018).Moustafa N, Turnbull B, Choo KKR. An ensemble intrusion detection technique based on proposed statistical flow features for protecting network traffic of internet of things. IEEE Internet of Things Journal. 2018;6(3):4815–4830. doi: 10.1109/JIOT.2018.2871719. [DOI] [Google Scholar]
  • Munir et al. (2018).Munir M, Siddiqui SA, Dengel A, Ahmed S. DeepAnT: a deep learning approach for unsupervised anomaly detection in time series. IEEE Access. 2018;7:1991–2005. doi: 10.1109/ACCESS.2018.2886457. [DOI] [Google Scholar]
  • Nadler, Aminov & Shabtai (2019).Nadler A, Aminov A, Shabtai A. Detection of malicious and low throughput data exfiltration over the DNS protocol. Computers & Security. 2019;80(3):36–53. doi: 10.1016/j.cose.2018.09.006. [DOI] [Google Scholar]
  • Nguyen, Woon & Ng (2015).Nguyen HL, Woon Y-K, Ng W-K. A survey on data stream clustering and classification. Knowledge and Information Systems. 2015;45(3):535–569. doi: 10.1007/s10115-014-0808-1. [DOI] [Google Scholar]
  • Pham et al. (2014).Pham D-S, Venkatesh S, Lazarescu M, Budhaditya S. Anomaly detection in large-scale data stream networks. Data Mining and Knowledge Discovery. 2014;28:145–189. doi: 10.1007/s10618-012-0297-3. [DOI] [Google Scholar]
  • Podder et al. (2023).Podder KK, Chowdhury MEH, Almaadeed S, Nisha NN, Mahmud S, Hamadelneil F, Almkhlef T, Aljofairi H, Mushtak A, Khandakar AA, Zughaier SM. Deep learning-based middle cerebral artery blood flow abnormality detection using flow velocity waveform derived from transcranial doppler ultrasound. Biomedical Signal Processing and Control. 2023;85:104882. doi: 10.1016/j.bspc.2023.104882. [DOI] [Google Scholar]
  • Pramanik et al. (2022).Pramanik A, Pal SK, Maiti J, Mitra P. Traffic anomaly detection and video summarization using spatio-temporal rough fuzzy granulation with Z-numbers. IEEE Transactions on Intelligent Transportation Systems. 2022;23(12):24116–24125. doi: 10.1109/TITS.2022.3198595. [DOI] [Google Scholar]
  • Radke et al. (2018).Radke AJ, Cymrot S, A’Heam K, Wagner A, Angle B. “Small data” anomaly detection for unmanned systems. 2018 IEEE Autotestcon; Piscataway: IEEE; 2018. [Google Scholar]
  • Ramírez-Gallego et al. (2017).Ramírez-Gallego S, Krawczyk B, García S, Woźniak M, Herrera F. A survey on data preprocessing for data stream mining: current status and future directions. Neurocomputing. 2017;239(1):39–57. doi: 10.1016/j.neucom.2017.01.078. [DOI] [Google Scholar]
  • Raut et al. (2023).Raut A, Shivhare A, Chaurasiya VK, Kumar M. AEDS-IoT: adaptive clustering-based event detection scheme for IoT data streams. Internet of Things. 2023;22(1):100704. doi: 10.1016/j.iot.2023.100704. [DOI] [Google Scholar]
  • Ren, Ye & Li (2017).Ren H, Ye Z, Li Z. Anomaly detection based on a dynamic Markov model. Information Sciences. 2017;411(2):52–65. doi: 10.1016/j.ins.2017.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Ring et al. (2017).Ring M, Wunderlich S, Grüdl D, Landes D, Hotho A. Flow-based benchmark data sets for intrusion detection. Proceedings of the 16th European Conference on Cyber Warfare and Security; Dublin, Ireland: Academic Conferences Ltd; 2017. pp. 361–369. [Google Scholar]
  • Saheed, Abdulganiyu & Tchakoucht (2023).Saheed YK, Abdulganiyu OH, Tchakoucht TA. A novel hybrid ensemble learning for anomaly detection in industrial sensor networks and SCADA systems for smart city infrastructures. Journal of King Saud University-Computer and Information Sciences. 2023;35(5):101532. doi: 10.1016/j.jksuci.2023.03.010. [DOI] [Google Scholar]
  • Scaranti et al. (2022).Scaranti GF, Carvalho LF, Junior SB, Lloret J, Proença ML., Jr Unsupervised online anomaly detection in software defined network environments. Expert Systems with Applications. 2022;191(10):116225. doi: 10.1016/j.eswa.2021.116225. [DOI] [Google Scholar]
  • Schlimmer & Granger (1986).Schlimmer JC, Granger RH. Incremental learning from noisy data. Machine Learning. 1986;1(3):317–354. doi: 10.1007/BF00116895. [DOI] [Google Scholar]
  • Shao et al. (2023).Shao W, Wei Y, Rajapaksha P, Li D, Luo Z, Crespi N. Low-latency dimensional expansion and anomaly detection empowered secure IoT network. IEEE Transactions on Network and Service Management. 2023;20(3):3865–3879. doi: 10.1109/TNSM.2023.3246798. [DOI] [Google Scholar]
  • Shao et al. (2020).Shao P, Ye F, Liu Z, Wang X, Lu M, Mao Y. Improving iForest for hydrological time series anomaly detection. International Conference on Algorithms and Architectures for Parallel Processing; Cham: Springer; 2020. pp. 170–183. [Google Scholar]
  • Sharafaldin, Lashkari & Ghorbani (2018).Sharafaldin I, Lashkari AH, Ghorbani AA. International Conference on Information Systems Security and Privacy. Vol. 1. Setubal, Portugal: SciTePress; 2018. Toward generating a new intrusion detection dataset and intrusion traffic characterization; pp. 108–116. [DOI] [Google Scholar]
  • Shylendra et al. (2020).Shylendra A, Shukla P, Mukhopadhyay S, Bhunia S, Trivedi AR. Low power unsupervised anomaly detection by nonparametric modeling of sensor statistics. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2020;28(8):1833–1843. doi: 10.1109/TVLSI.2020.2984472. [DOI] [Google Scholar]
  • Souiden, Brahmi & Toumi (2016).Souiden I, Brahmi Z, Toumi H. A survey on outlier detection in the context of stream mining: review of existing approaches and recommadations. International Conference on Intelligent Systems Design and Applications; Cham: Springer; 2016. pp. 372–383. [Google Scholar]
  • Stahmann & Rieger (2021).Stahmann P, Rieger B. Requirements identification for real-time anomaly detection in Industrie 4.0 machine groups: a structured literature review. Proceedings of the 54th Hawaii International Conference on System Sciences; 2021. pp. 5738–5747. [DOI] [Google Scholar]
  • Talapula et al. (2023).Talapula DK, Ravulakollu KK, Kumar M, Kumar A. SAR-BSO meta-heuristic hybridization for feature selection and classification using DBNover stream data. Artificial Intelligence Review. 2023;56(12):14327–14365. doi: 10.1007/s10462-023-10494-4. [DOI] [Google Scholar]
  • Tang et al. (2020).Tang M, Fu X, Wu H, Huang Q, Zhao Q. Traffic flow anomaly detection based on robust ridge regression with particle swarm optimization algorithm. Mathematical Problems in Engineering. 2020;2020(1):3673085. doi: 10.1155/2020/3673085. [DOI] [Google Scholar]
  • Tao & Michailidis (2019).Tao J, Michailidis G. A statistical framework for detecting electricity theft activities in smart grid distribution networks. IEEE Journal on Selected Areas in Communications. 2019;38(1):205–216. doi: 10.1109/JSAC.2019.2952181. [DOI] [Google Scholar]
  • The UCI KDD Archive (1999).The UCI KDD Archive . Irvine: The University of California; 1999. KDD Cup 1999 data. [Google Scholar]
  • Tidjon, Frappier & Mammar (2019).Tidjon LN, Frappier M, Mammar A. Intrusion detection systems: a cross-domain overview. IEEE Communications Surveys & Tutorials. 2019;21(4):3639–3681. doi: 10.1109/COMST.2019.2922584. [DOI] [Google Scholar]
  • Ting et al. (2023).Ting KM, Washio T, Wells J, Zhang H, Zhu Y. Isolation kernel estimators. Knowledge and Information Systems. 2023;65(2):759–787. doi: 10.1007/s10115-022-01765-7. [DOI] [Google Scholar]
  • Tong & Prasanna (2017).Tong D, Prasanna VK. Sketch acceleration on FPGA and its applications in network anomaly detection. IEEE Transactions on Parallel and Distributed Systems. 2017;29(4):929–942. doi: 10.1109/TPDS.2017.2766633. [DOI] [Google Scholar]
  • Vaccari et al. (2020).Vaccari I, Chiola G, Aiello M, Mongelli M, Cambiaso E. MQTTset, a new dataset for machine learning techniques on MQTT. Sensors. 2020;20(22):6578. doi: 10.3390/s20226578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • Vallim et al. (2014).Vallim RM, de Mello RF, de Carvalho AC, Gama JG. Unsupervised density-based behavior change detection in data streams. Intelligent Data Analysis. 2014;18(2):181–201. doi: 10.3233/IDA-140636. [DOI] [Google Scholar]
  • Wahab (2022).Wahab OA. Intrusion detection in the IoT under data and concept drifts: online deep learning approach. IEEE Internet of Things Journal. 2022;9(20):19706–19716. doi: 10.1109/JIOT.2022.3167005. [DOI] [Google Scholar]
  • Wambura, Huang & Li (2022).Wambura S, Huang J, Li H. Robust anomaly detection in feature-evolving time series. The Computer Journal. 2022;65(5):1242–1256. doi: 10.48550/arXiv.2202.02721. [DOI] [Google Scholar]
  • Wang, Bah & Hammad (2019).Wang H, Bah MJ, Hammad M. Progress in outlier detection techniques: a survey. IEEE Access. 2019;7:107964–108000. doi: 10.1109/ACCESS.2019.2932769. [DOI] [Google Scholar]
  • Wang et al. (2023).Wang L, Chen S, Chen F, He Q, Liu J. B-Detection: runtime reliability anomaly detection for MEC services with boosting LSTM autoencoder. IEEE Transactions on Mobile Computing. 2023;23(4):2599–2613. doi: 10.1109/TMC.2023.3262233. [DOI] [Google Scholar]
  • Wang et al. (2020a).Wang X, Liu Q, Pan Z, Pang G. APT attack detection algorithm based on spatio-temporal association analysis in industrial network. Journal of Ambient Intelligence and Humanized Computing. 2020a;19(12):1–10. doi: 10.1007/s12652-020-01840-3. [DOI] [Google Scholar]
  • Wang et al. (2020b).Wang Y, Ding Y, He X, Fan X, Lin C, Li F, Wang T, Luo Z, Luo J. Novelty detection and online learning for chunk data streams. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2020b;43(7):2400–2412. doi: 10.1109/TPAMI.2020.2965531. [DOI] [PubMed] [Google Scholar]
  • Wang et al. (2021).Wang W, Wang Z, Zhou Z, Deng H, Zhao W, Wang C, Guo Y. Anomaly detection of industrial control systems based on transfer learning. Tsinghua Science & Technology. 2021;26(6):821–832. doi: 10.26599/TST.2020.9010041. [DOI] [Google Scholar]
  • Wang et al. (2022).Wang C, Zhou H, Hao Z, Hu S, Li J, Zhang X, Jiang B, Chen X. Network traffic analysis over clustering-based collective anomaly detection. Computer Networks. 2022;205(3):108760. doi: 10.1016/j.comnet.2022.108760. [DOI] [Google Scholar]
  • Wu et al. (2024).Wu J, Wang W, Li Y, Luo H, Hu S, Li Y. An encrypted network traffic classification strategy: combining locality-sensitive hashing with transformer encoder and CNN. 2024 IEEE 32nd International Conference on Network Protocols (ICNP); Piscataway: IEEE; 2024. [Google Scholar]
  • Xiao et al. (2023).Xiao Q, Cai Y, Cao Y, Chen S. Accurate and O(1)-time query of per-flow cardinality in high-speed networks. IEEE/ACM Transactions on Networking. 2023;31(6):2994–3009. doi: 10.1109/TNET.2023.3268980. [DOI] [Google Scholar]
  • Xiao et al. (2020).Xiao Q, Chen S, Zhou Y, Luo J. Estimating cardinality for arbitrarily large data stream with improved memory efficiency. IEEE/ACM Transactions on Networking. 2020;28(2):433–446. doi: 10.1109/TNET.2020.2970860. [DOI] [Google Scholar]
  • Xiaolan et al. (2022).Xiaolan W, Ahmed MM, Husen MN, Qian Z, Belhaouari SB. Evolving anomaly detection for network streaming data. Information Sciences. 2022;608:757–777. doi: 10.1016/j.ins.2022.06.064. [DOI] [Google Scholar]
  • Xing, Demertzis & Yang (2020).Xing L, Demertzis K, Yang J. Identifying data streams anomalies by evolving spiking restricted Boltzmann machines. Neural Computing and Applications. 2020;32(11):6699–6713. doi: 10.1007/s00521-019-04288-5. [DOI] [Google Scholar]
  • Xu et al. (2023).Xu L, Ding X, Peng H, Zhao D, Li X. ADTCD: an adaptive anomaly detection approach toward concept drift in IoT. IEEE Internet of Things Journal. 2023;10(18):15931–15942. doi: 10.1109/JIOT.2023.3265964. [DOI] [Google Scholar]
  • Yang et al. (2021).Yang Z, Abbasi IA, Mustafa EE, Ali S, Zhang M. An anomaly detection algorithm selection service for IoT stream data based on tsfresh tool and genetic algorithm. Security and Communication Networks. 2021;2021(1):6677027. doi: 10.1155/2021/6677027. [DOI] [Google Scholar]
  • Yang, Chen & Fan (2021).Yang Y, Chen L, Fan C. ELOF: fast and memory-efficient anomaly detection algorithm in data streams. Soft Computing. 2021;25(6):4283–4294. doi: 10.1007/s00500-020-05442-1. [DOI] [Google Scholar]
  • Yang et al. (2022).Yang Y, Ding S, Liu Y, Meng S, Chi X, Ma R, Yan C. Fast wireless sensor for anomaly detection based on data stream in an edge-computing-enabled smart greenhouse. Digital Communications and Networks. 2022;8(4):498–507. doi: 10.1016/j.dcan.2021.11.004. [DOI] [Google Scholar]
  • Yang et al. (2018).Yang T, Jiang J, Liu P, Huang Q, Gong J, Zhou Y, Miao R, Li X, Uhlig S. Elastic sketch: adaptive and fast network-wide measurements. Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication; New York: ACM; 2018. pp. 561–575. [Google Scholar]
  • Yin, Li & Yin (2020).Yin C, Li B, Yin Z. A distributed sensing data anomaly detection scheme. Computers & Security. 2020;97(4):101960. doi: 10.1016/j.cose.2020.101960. [DOI] [Google Scholar]
  • Yoo, Kim & Kim (2019).Yoo Y-H, Kim U-H, Kim J-H. Recurrent reconstructive network for sequential anomaly detection. IEEE Transactions on Cybernetics. 2019;51(3):1704–1715. doi: 10.1109/TCYB.2019.2933548. [DOI] [PubMed] [Google Scholar]
  • Yoon et al. (2022).Yoon S, Lee Y, Lee J-G, Lee BS. Adaptive model pooling for online deep anomaly detection from a complex evolving data stream. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; New York: ACM; 2022. pp. 2347–2357. [Google Scholar]
  • Yu et al. (2018).Yu W, Cheng W, Aggarwal CC, Zhang K, Chen H, Wang W. NetWalk: a flexible deep embedding approach for anomaly detection in dynamic networks. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; New York: ACM; 2018. pp. 2672–2681. [Google Scholar]
  • Yu, Jibin & Jiang (2016).Yu Q, Jibin L, Jiang L. An improved ARIMA-based traffic anomaly detection algorithm for wireless sensor networks. International Journal of Distributed Sensor Networks. 2016;12(1):9653230. doi: 10.1155/2016/9653230. [DOI] [Google Scholar]
  • ZareMoodi, Kamali Siahroudi & Beigy (2019).ZareMoodi P, Kamali Siahroudi S, Beigy H. Concept-evolution detection in non-stationary data streams: a fuzzy clustering approach. Knowledge and Information Systems. 2019;60(3):1329–1352. doi: 10.1007/s10115-018-1266-y. [DOI] [Google Scholar]
  • Zeng et al. (2023a).Zeng Z, Huang R, Xiao R, Lin X, Zhang S. Anomaly detection for high-dimensional dynamic data stream using stacked habituation autoencoder and union kernel density estimator. Concurrency and Computation: Practice and Experience. 2023a;35(22):e7718. doi: 10.1002/cpe.7718. [DOI] [Google Scholar]
  • Zeng et al. (2023b).Zeng Z, Xiao R, Lin X, Luo T, Lin J. Double locality sensitive hashing Bloom filter for high-dimensional streaming anomaly detection. Information Processing & Management. 2023b;60(3):103306. doi: 10.1016/j.ipm.2023.103306. [DOI] [Google Scholar]
  • Zhang, Zhao & Li (2019).Zhang L, Zhao J, Li W. Online and unsupervised anomaly detection for streaming data using an array of sliding windows and PDDs. IEEE Transactions on Cybernetics. 2019;51(4):2284–2289. doi: 10.1109/TCYB.2019.2935066. [DOI] [PubMed] [Google Scholar]
  • Zheng et al. (2017).Zheng Z, Jeong H-Y, Huang T, Shu J. KDE based outlier detection on distributed data streams in multimedia network. Multimedia Tools and Applications. 2017;76(17):18027–18045. doi: 10.1007/s11042-016-3681-y. [DOI] [Google Scholar]
  • Zhou et al. (2020).Zhou X, Hu Y, Liang W, Ma J, Jin Q. Variational LSTM enhanced anomaly detection for industrial big data. IEEE Transactions on Industrial Informatics. 2020;17(5):3469–3477. doi: 10.1109/TII.2020.3022432. [DOI] [Google Scholar]
  • Zhou, Zhang & Hong (2019).Zhou Z, Zhang D, Hong X. RL-Sketch: scaling reinforcement learning for adaptive and automate anomaly detection in network data streams. 2019 IEEE 44th Conference on Local Computer Networks (LCN); Piscataway: IEEE; 2019. pp. 340–347. [Google Scholar]
  • Zhu et al. (2020).Zhu R, Ji X, Yu D, Tan Z, Zhao L, Li J, Xia X. KNN-based approximate outlier detection algorithm over IoT streaming data. IEEE Access. 2020;8:42749–42759. doi: 10.1109/ACCESS.2020.2977114. [DOI] [Google Scholar]
  • Zou et al. (2023).Zou B, Yang K, Kui X, Liu J, Liao S, Zhao W. Anomaly detection for streaming data based on grid-clustering and Gaussian distribution. Information Sciences. 2023;638:118989. doi: 10.1016/j.ins.2023.118989. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The following information was supplied regarding data availability:

This is a literature review.


Articles from PeerJ Computer Science are provided here courtesy of PeerJ, Inc

RESOURCES