GraphTS: Graph-represented time series for subsequence anomaly detection

Roozbeh Zarei; Guangyan Huang; Junfeng Wu

doi:10.1371/journal.pone.0290092

. 2023 Aug 16;18(8):e0290092. doi: 10.1371/journal.pone.0290092

GraphTS: Graph-represented time series for subsequence anomaly detection

Roozbeh Zarei ¹, Guangyan Huang ^1,^*, Junfeng Wu ¹

Editor: Vijayalakshmi Kakulapati²

PMCID: PMC10431630 PMID: 37585396

Abstract

Automatic detection of subsequence anomalies (i.e., an abnormal waveform denoted by a sequence of data points) in time series is critical in a wide variety of domains. However, most existing methods for subsequence anomaly detection often require knowing the length and the total number of anomalies in time series. Some methods fail to capture recurrent subsequence anomalies due to using only local or neighborhood information for anomaly detection. To address these limitations, in this paper, we propose a novel graph-represented time series (GraphTS) method for discovering subsequence anomalies. In GraphTS, we provide a new concept of time series graph representation model, which represents both recurrent and rare patterns in a time series. Particularly, in GraphTS, we develop a new 2D time series visualization (2Dviz) method, which compacts all 1D time series patterns into a 2D spatial temporal space. The 2Dviz method transfers time series patterns into a higher-resolution plot for easier sequence anomaly recognition (or detecting subsequence anomalies). Then, a Graph is constructed based on the 2D spatial temporal space of time series to capture recurrent and rare subsequence patterns effectively. The represented Graph also can be used to discover single and recurrent subsequence anomalies with arbitrary lengths. Experimental results demonstrate that the proposed method outperforms the state-of-the-art methods in terms of accuracy and efficiency.

1 Introduction

Time series anomaly detection is an important problem with applications in various domains such as manufacturing, medical, and engineering [1–9]. Generally, an anomaly changing with time [10] can be a point anomaly (i.e., a single value beyond a regular range) or a sequence anomaly (i.e., an abnormal waveform denoted by a sequence of data points) [11, 12]. Detecting sequence anomalies is crucial, especially in real-world applications, where the values of individual points may not exhibit any anomaly but the trend (or the shape of a subsequence) may be abnormal. More significantly, an abnormal trend often means a possible problem at an early stage that may lead to a severe problem if not intervened. For instance, detecting some abnormal heartbeat waveforms (i.e., arrhythmia) in electrocardiograms may indicate an early stage of a severe heart disease. Therefore, this paper focuses on detecting subsequence anomalies.

Unfortunately, automatically detecting subsequence anomalies faces three challenges. First, most existing subsequence anomaly detection methods only work in a specific domain-determined waveform, such as signals of electrocardiogram (ECG) [13] or electroencephalogram (EEG) [14, 15]. They involve specific domain knowledge about the waveform and length of the anomaly to discover anomalous subsequences. As the characteristics of the anomalies (i.e., waveform patterns and lengths) in different domains are often different, it is hard to apply these domain-specific techniques to another domain. Second, several domain-agnostic methods [4, 16] specifically developed for detecting subsequence anomalies in diverse domains demonstrate an inability to identify repeat anomalies comprising highly similar instances of anomalous subsequences [17–19]. These methods use local or neighborhood information to define subsequence anomalies. For example, they often use the largest distances of subsequence to its nearest neighbors to identify anomalies. The assumption behind these methods is that the abnormal subsequence is distant (i.e., entirely separated) from the normal subsequences; that is, if a subsequence pattern has at least two instances, it is not abnormal. Therefore, these methods can detect a single abnormal subsequence or multiple dissimilar abnormal subsequences (referred to as discords) in time series. However, they fail to detect those recurrent abnormal subsequences (called the “twin freak” problem [19]) with similar shapes. To solve this issue, the m^th nearest neighbor can be used instead of the first nearest neighbor to calculate the discord score. In [20], an abnormal subsequence pattern with m instances can be identified. But, this method implies that the number of anomalies is known; this is hard to be satisfied in reality. Third, existing methods for discord discovery can accurately find anomalies when the proper subsequence length is selected [21] as input parameter by the user but suffers noticeably when the length is mismatched. To show how these methods are impacted by subsequence length, we consider two subsequence lengths, 50 and 70, to compute anomaly score using STOMP [21] for every subsequence on a time series as shown in Fig 1(a). The time series is the Arterial Blood Pressure (ABP) of a healthy man on title table with one synthetic anomaly (highlighted in red area in Fig 1(a)). For a subsequence length of 70 as shown in Fig 1(b), the anomaly is correctly identified as indicated by the highest discord score. However, for subsequence length of 50 as shown in Fig 1(c), the normal part of the signal is identified as an anomaly, therefore it is a false positive.

Fig 1 — (a) the ABP time series of a healthy man on a tilt table test with one anomaly highlighted in red. The discord score for each subsequence of length 70 (b) and 50 (c). The corresponding subsequence with the highest score is considered an anomaly.

To address the aforementioned three challenges, we propose a novel graph-represented time series (GraphTS) method for subsequence anomaly detection. The GraphTS represents time series as a graph that is constructed using normal and abnormal time series subsequences. In GraphTS, we first develop a new 2D visualization (2Dviz) method, which transfers a time series into a 2-dimensional spatial-temporal space (2DSTS) by projecting subsequences with similar patterns into very close spatial locations. Then, the spatial and temporal information in 2DSTS is used to construct a graph, where nodes represent subsequence patterns and edges represent the number of successive occurrences of these patterns in the original time series. The constructed graph represents all subsequences in time series, including regular, frequent, and anomalous patterns. The recurrent consecutive normal patterns and rare abnormal patterns in time series are represented by paths in the graph that are composed of high- and low-weighted edges, respectively. This enables distinguishing between normal subsequences from abnormal ones using the represented graph.

One advantage of GraphTS is that the graph representation of time series captures both subsequence patterns (i.e., recurrent patterns and abnormal patterns) and thus can detect a complete set of subsequence anomalies. Another advantage of our GraphTS method is its ability to simplify the task of detecting anomalous subsequences by converting raw time series data into a graph representation; so, anomalous subsequences are those with low weights on the path’s edges between two nodes. The third advantage is that GraphTS builds the graph without knowing the length of the anomalies and it can identify anomalous sequences with arbitrary lengths using the same representation graph. Experimental results show that the proposed method outperforms the state-of-the-art STOMP [21] and Series2Graph [22] methods in terms of both accuracy and execution time.

Our proposed method is different from the two most related works (STOMP and Series2Graph), and we explain as follows. (1) STOMP detects anomalies by defining local discord patterns; therefore, it has the limitation of being unable to detect recurrent anomalies. Our proposed GraphTS globally project similar subsequences on time series into close nodes in a 2D graph and thus can correctly detects both single and frequent anomalies based on its representation path on the graph. (2) STOMP needs to execute for each anomaly length ℓ, and its performance is degraded if the value of ℓ is not correctly selected. Our method compacts all global information of a time series into a graph that allows identifying anomalies with different lengths, and thus, it is robust to variation of anomaly length. (3) While Series2Graph also utilizes a graph representation to identify anomalies, it adopts an entirely different approach to create the graph, which will be elaborated in Section 2. Our method utilizes the length of normal patterns to construct a graph, ensuring that the variation in anomaly length ℓ does not affect its performance.

Our contributions in this paper can be summarized as follows.

We develop a novel 2D time series visualization (2Dviz) method, which can compact all patterns on one-dimensional time series into a 2D spatial-temporal space, where time series patterns and anomalies are mapped in a higher-resolution 2Dviz plot for much easier subsequence anomaly recognition.
We propose a novel GraphTS method for domain agnostic subsequence anomaly detection. In GraphTS, we provide a new time series graph representation model, which represents both recurrent and rare patterns in time series. GraphTS can effectively detect recurrent and single subsequence anomalies in an unsupervised way (without knowing the length and the total number of anomalies) and can be used to discover variable-length subsequence anomalies.
We demonstrate the accuracy and efficiency of GraphTS by comparing to two state-of-the-art methods (i.e., STOMP and Series2Graph) on real-world time series datasets containing single and multiple recurrent subsequence anomalies.

The remainder of this paper is organized as follows. Section 2 presents the related work. In section 3, we define the problem. We detail the GraphTS method for subsequence anomaly detection using graph representation in section 4. Section 5 reports the experimental results over various real datasets, and section 6 concludes this paper.

2 Related work

As our focus is to detect anomalous subsequence in time series based on graph representation in this paper, we present a brief review of subsequence anomaly detection methods and time series graph representation for anomaly detection.

A. Subsequence anomaly detection. The problem of detecting subsequence anomalies from time series has been studied in several works based on the discord definition [4, 23–28]. In these methods, anomalous subsequences (discord) are identified based on their distances to all other subsequences in the time series. Specifically, the subsequence with the largest Euclidean distance to its nearest neighbors is considered as a discord or anomaly. These discord discovery algorithms can be applied either on original raw values [4, 25, 28] or on a representation of the subsequences such as Symbolic aggregate approximation (SAX) [23, 24] or Haar wavelets [26, 27].

HOTSAX is developed in [29] based on the SAX representation to detect the time series discord. The SAX is a method for converting time series data into a symbolic representation. The SAX symbolizes the subsequence by the mean of each subsequence’s segment; however, due to dimension reduction, it may omit crucial patterns in the subsequence. Several algorithms have been developed to improve the SAX representation of time series [30, 31]. Extended SAX (ESAX) is developed based on SAX in [30] by adding two extra points, min and max points, to each subsequence segment’s mean value for improving SAX representation. Trend Distance (SAX-TD) [31] integrates the SAX distance with a weighted trend distance to improve SAX representation. It computes the distance of trends using segments’ starting and ending points. Although both ESAX and SAX-TD methods improve the original SAX representation, they still may lose important time series pattern information due to dimension reduction. Senin et al. [16, 23] developed GrammarViz method based on grammar compression of time series discretized with SAX to detect time series discord. Subsequences correspond to rare grammar rules are considered discord as their SAX symbols are not compressible and most likely to be rare patterns. The performance of these discord discovery algorithms based on SAX is heavily reliant on the quality of the SAX data representation. The SAX method has three user-defined parameters that need to carefully set in order to get proper representation. However, selecting and fine-tuning these parameters are generally not trivial.

Recently, a matrix-profile-based approach, STOMP [21], is developed that computes the matrix profile that allows discovering top-1 discord. The STOMP provides a fast calculation of the distance of each subsequence to its nearest non-self neighbor. The main advantage of the above methods is their simplicity. However, these methods are not able to detect repeated anomalies with the same shape in different instances. To resolve the problem of detecting recurrent (repeated) anomalies by discovering discord, the notion of m^th discord is proposed in [20], known as DRAG, in which the subsequence is a discord if it has the largest Euclidean distances to its m^th nearest neighbor. The DRAG algorithm is separated into two steps. The initial step involves selecting potential discord sequences by identifying those whose distances to their nearest neighbors fall within a certain range. The second step, known as refinement, is then used to determine the precise discord sequences from among the candidates identified in the first step. Nakamura et al. [32] recently developed MERLIN method based on DRAG that can scan all discords within a specified length range. The MERLIN method finds discords of all lengths by calling the fixed-length DRAG method with all lengths within a specified length range. Although m^th discord definition addresses the limitation of simple discord, m is a user-define parameter and is challenging to set. The performance of the method is very sensitive to the variation of the length, and setting a value larger or smaller than the correct one can lead to false positive results. In contrast to these methods, our proposed method identifies abnormal subsequences based on a new definition. We define abnormal sequences as those with the lowest path weight in the graph representing the time series. Our graph model can capture both single and recurrent anomalies and normal patterns in time series.

B. Time series graph representation for anomaly detection. Phase space reconstruction is an effective method used to analyze non-linear time series. It involves transforming a time series [33, 34] into a set of vectors, which are then used for constructing complex networks [35]. The complex network contains an underlying complex and irregular structure of time series and can be analyzed by quantifying the graph features, such as node degree distributions and path lengths. Time series analysis using complex network has been applied to various domains [36, 37]. In [37], a new method was developed to map EEG time series to a complex network. Then, the network is used to extract sudden fluctuations (anomalies) in EEG time series. There are some graph-based outlier detection methods that map a time series to a graph by discovering relationships [38]. A method based on a time series graph representation is developed in [39] to detect outliers. They applied a sliding window and calculated the distance between subsequences. Then, they use each subsequence as a node, and the weights of the edges are distances. They developed a node clustering model based on graph to detect outliers. Although these techniques are similar to our proposed method as they apply a graph representation to approach the problem, our graph representation of time series is more compact and organized better. Our method considers a group of subsequences as a node, while the above methods consider each subsequence as a node. A graph-based method, Series2Graph, is developed in [22] also to discover both single and recurrent anomalies; this is most related to our method.

Our GraphTS method comprises two innovative techniques when compared to Series2Graph. First, in our GraphTS, a 2Dviz method is developed to map normalized subsequences into a 2D space by keeping the structural similarities between subsequences; that is, subsequences with similar shapes are projected into the same area in the 2D space. In contrast, in Series2Graph, the local convolution of subsequences is mapped into 2D space to enhance the wave shapes by reducing noise. Second, in our GraphTS, the 2Dviz plot space is divided into grid cells, and each cell is considered as a node to represent a group of sequences with the same shapes (i.e., structure/wave patterns). As a result, the number of nodes in the represented graph is kept below a certain number, enabling our method to be scalable and efficient when applied to large datasets. In contrast, Series2Graph considers the most crossed areas in the 2D space as a node, where repeat subsequences most likely pass through. The number of nodes in represented graph built with Series2Graph increase with increasing the size of time series data which makes more complex graph for long time series.

3 The problem modeling

In this paper, we introduce a novel approach to transform a time series into a graph representation. Here, each subsequence is assigned a score, transforming the problem of abnormal subsequence detection into the task of identifying subsequences with lower scores. We formally model the problem as follows.

First, we aim to detect abnormal subsequences (i.e., anomalous patterns in local regions of a time series).

Definition 1 (Time series) A time series $T \in R^{n}$ is a sequence of real-valued numbers $t_{i} \in R [t_{1}, t_{2}, . . ., t_{n}], n = ∣ T ∣$ is the length of T [22].

Definition 2 (Subsequence) subsequence $T_{i, ℓ} \in R^{n}$ of a time series T is a subset of continuous values on T of length ℓ starting at position i; formally, T_i,ℓ = [t_i, t_i+1, …, t_i+ℓ−1] [22].

Then, we borrow the idea of a directed graph (Definition 3) to define a new concept, time series graph representation, in Definition 4.

Definition 3 (A directed graph) A directed graph G is a pair (V,E), where V is a finite set of nodes, and E is a finite set of edges [22]. The elements of E are ordered pairs of node with edge weight set W.

Definition 4 (Time series graph representation, G) Given a time series T, G(V, E) is a directed graph that represents both recurrent and rare patterns in T. The G is a directed graph consists of a node set V = [v₁, v₂, …, v_m], a edge set E ⊆ {(v_i, v_j) ∣ v_i, v_j ∈ V} and a edge weight set W ⊆ {w_ij(v_i, v_j) ∣ v_i, v_j ∈ V} [22].

Our vision is to represent a time series, T, as a directed graph, G(V, E), which characterizes both normal and abnormal patterns in time series. In G, node set V represents various subsequence patterns in time series, and edge set E represents the number of successive occurrences of these patterns. Edge weight w_ij(v_i, v_j) presents the sum of successive occurrences between v_i and v_j patterns. Therefore, recurrent consecutive normal patterns and rare abnormal patterns in time series can be represented by paths in G that are composed of high weighted edges and low weighted edges, respectively. This is based on the fact that we assume the number of abnormal patterns are less than the number of normal patterns in time series. Thus, we develop new concept of the subsequence score, which can be calculated by Eq 1 in Definition 5 as a function of its representation path in the G and can be used to rank the subsequences.

Definition 5 (Subsequence score) We assume that G(V, E) is the graph representation of a time series T, all subsequences of length ℓ in T and their representation paths P_ℓ = {P_ℓ(i) = < vⁱ⁺¹, vⁱ⁺², …, v^i+ℓ >, v ∈ V and i ∈ [0, n − ℓ + 1]} in G. P_ℓ(i) is the path between the nodes vⁱ to v^i+ℓ. Then, we develop Eq 1 to calculate the subsequence score as follows:

\begin{matrix} s c o r e (P_{ℓ} (i)) = \frac{\sum_{k = i}^{i + ℓ - 1} w (v^{k}, v^{k + 1})}{ℓ} . \end{matrix}

(1)

where w(v^k, v^k+1) is the edge weight between nodes v^k and v^k+1, and ℓ is the subsequence length.

Based on the above definitions, the problem of this paper is modeled as follows.

The Problem Modeling. Given a time series T and subsequence length of ℓ, we construct graph G(V, E) from T in an unsupervised way (without knowing the labels of the subsequences in T). Based on the graph G, we calculate subsequences scores (Definition 5) and change the problem of detecting anomalous subsequences from time series into the problem of finding those subsequences paths in Garph G that have a much lower score compared to recurrent normal subsequences.

Note that by the (Definition 5), the Score of trivial matches [3] where subsequences largely overlaps with themselves are very close to each other (e.g., the score(P_ℓ(i)) and score(P_ℓ(i + 1)) for subsequences T_i,ℓ and T_i+1,ℓ are almost the same as they are overlapped and only have one point difference). To avoid these trivial matches, we incorporate an “exclusion-zone” of length ℓ before and after the location of the subsequence to be ignored. Therefore, we exclude the trivial matches to make sure overlapping subsequences are not reported. The symbols we use in this paper are defined Table 1.

Table 1. Table of symbols.

Symbol	Description
T	a time series
∣T∣	cardinality of T
T _i,ℓ	subsequence of length ℓ starting at position i
w _g	input window length
ℓ	anomaly length
l _np	normal pattern length
Z	matrix of all extracted subsequences
Z _n	normalized subsequences matrix
2DSTS	reduced 2D matrix of Z_n
n _cell	number of grid size
V	node set
E	edge set
W	edge weight set
G(V, E)	directed graph representing T
P _T	path of time series T in graph G
P_ℓ(i)	path of subsequence T_i,ℓ into P_T

Open in a new tab

4 The proposed approach

In this section, we provide an overview of the GraphTS method for subsequence anomaly detection in Table 2, which summarizes all steps in our approach to detect subsequence anomalies using a graph representation of time series.

Table 2. Overview of the proposed method.

The GraphTS Method.
input: time series T, anomaly length ℓ, input window length w_g
output: subsequence anomalies
Step 1 2D visualization of time series (Algorithm 1 2Dviz). Transfer all subsequences of length w_g in T into a 2D spatial-temporal space, where subsequence with similar patterns are projected into similar spatial locations;
Step 2 Construction of graph (Algorithm 2 ConGraph). Construct a directed graph based on the 2D spatial-temporal space where spatial information is used to create the node set and temporal information is used to extract the edge set. The nodes represent the various subsequence patterns of length w_g in time series and edges represent the number of successive occurrences of these patterns;
Step 3 Subsequence anomaly detection (Algorithm 3 AnomalyScore). Calculate the abnormality score for each subsequence of length ℓ based on their path in the constructed graph and return a ranked list of abnormal subsequences in T.

Open in a new tab

In GraphTS as shown in Table 2, the window length, w_g, is a user-defined parameter and is different from the length of an interesting anomaly subsequence, ℓ. However, we set w_g based on the length of normal patterns in time series. In the experimental evaluation, we show that GraphTS is robust to different values of w_g when the selected value is close to the length of normal patterns in the time series. Although the length of the abnormal subsequence, ℓ, can be defined by users, the proposed method is robust to accurately detect anomalous subsequences under various values of ℓ. The remaining parameters in GraphTS are internal and can be set to a default value. For example, the number of nodes in the graph, n_cell, is set to 100 nodes. Fig 2 illustrates the procedure of our proposed method. In the following subsection, we provide details of each step in the GraphTS method as shown in Table 2.

Fig 2 — (a) An example time series T extracted from MITBIH 1 dataset, with four anomalous subsequences (highlighted in red areas). (b) 2DSTS visualization of time series (step 1). (c) Graph construction (step 2). (d) Subsequence anomaly detection using subsequence score curves for all subsequences of T: low score indicates anomalous subsequences (step 3).

4.1 2D visualization of time series

We first develop the 2Dviz algorithm (Algorithm 1) for transferring a time series into a 2-dimensional spatial-temporal space (2DSTS), where the patterns of time series subsequences are preserved. We borrow the idea from [40] to develop the 2D Visualization method. However, our method is different from [40]; in [40], the whole time series is normalized using unity-based normalization to set its value into range [0, 1] while our method utilizes Z-normalization (i.e., normalizing every value in a dataset such that the mean of all of the values is 0 and the standard deviation is 1) for each subsequence of time series.

In Algorithm 1, the 2DSTS is obtained via three steps: (1) subsequence extraction, (2) subsequence normalization, and (3) dimension reduction. We first extract all subsequences of length w_g from T at Lines 1–2 in Algorithm 1, using a sliding window with a step of 1 point and create matrix $Z \in R^{(∣ T ∣ - w_{g} + 1) \times w_{g}}$ , containing all subsequences ${T_{i, w_{g}}, i \in [0, ∣ T ∣ - w_{g} + 1]}$ . Each row of matrix Z is a vector of size w_g and defined as Z[i, :] which correspond to extracted subsequence $T_{i, w_{g}}$ . Then, we use Z-normalization to bring the mean of each subsequence to zero and its standard deviation to one to enable comparison of subsequences structural similarities at Line 3. This is done by subtracting each subsequence mean μ_i from each subsequence Z[i, :] and dividing it by its standard deviation δ_i. We denote the normalized subsequence matrix as Z_n. The matrix, Z_n, is in high-dimensional space; that is, each data point represents a normalized subsequence that occurs at a different time interval. To reduce the dimensionality of matrix Z_n to two dimensions, we utilize a Principal Component Analysis (PCA) and only the top two components are kept in a reduced 2D matrix denoted as 2DSTS.

Algorithm 1 2Dviz

input: Time series T, input length w_g

output: 2D spatial-temporal space (2DSTS)

1. foreach i ∈ [0, ∣T∣ −w_g + 1] do

2. $Z [i, :] \leftarrow T_{i, w_{g}}$ ; ⊳ Subsequence extraction

3. $Z_{n} [i, :] \leftarrow \frac{Z [i, :] - μ_{i}}{δ_{i}}$ ; ⊳ Subsequence normalization

4. 2DSTS ← PCA.fit_tranform(Z_n); ⊳ Dimension reduction

Fig 3 depicts the 2DSTS for the example time series shown in Fig 2(a) by setting w_g = 80. Each data point represents a normalized subsequence (spatial representation) and links between two points indicate the temporal order of subsequences (temporal representation). It can be seen from Fig 3 that subsequences with the same patterns appear close to each other in 2DSTS visualization (Abnormal subsequences: T₁ − T₄ and normal subsequences: T₅ − T₈). As the normal subsequences are more than abnormal subsequences, they create denser clusters in 2DSTS visualization, because they appear more frequently in time series. Fig 4 shows the temporal patterns (trajectories) in 2DSTS visualization for normal subsequences in Fig 4(b) and abnormal subsequences in Fig 4(c). The pattern (trajectory) difference between normal subsequences (N₁ − N₄) and abnormal subsequences (A₁ − A₄) is distinguishable in 2DSTS. We utilize these spatial and temporal characteristics of 2DSTS visualization in constructing a graph.

Fig 4 — (a) The sample time series with four normal subsequences and four abnormal subsequences annotated as N₁ − N₄ and A₁ − A₄, respectively. (b) Normal trajectories correspond to temporal order of the normal subsequences. (c) Abnormal trajectories correspond to temporal order of the abnormal subsequences. The highlights in the time series with the corresponding highlights points in the 2DSTS space indicate noticeable differences in normal and abnormal trajectories.

4.2 Construction of graph

This step aims to create a directed graph G(V, E) based on the 2DSTS as shown in Algorithm 2 in order to extract abnormal and normal subsequence patterns. The main idea is to use spatial and temporal information in 2DSTS to create node set V and edge set E, respectively.

The ConGraph algorithm (Algorithm 2) consists of three steps: node creation (creating node set V), edge extraction (extracting edge set E), and graph construction(constructing graph representation of time series). In node creation step, the 2DSTS space is divided into n_cell grid cells. Then, we consider each grid cell as a node v_i ∈ V. So all points in each cell (i.e., subsequences with similar patterns) are mapped to one node. As an example shown in Fig 2(c1), we divide the 2DSTS space into n_cell=25 grid cells and create a node set, V = [v₁, v₂, …, v₂₅].

In edge extraction step, a directed link (v_i, v_j), is established from node v_i to node v_j if two consecutive points (subsequences) occur between two cells in 2DSTS. This process applies to the entire 2DSTS matrix from the first point to the last point. The edge weight, w_ij, between two nodes, v_i and v_j, is the number of times two consecutive points occur between two cells in 2DSTS. For example in Fig 2(c2), there are four times two consecutive points occur between node v₄ and v₅ (that is, four links e_ij labeled as 1, 2, 3 and 4 in brown color), so the edge weight between these two nodes is 4 (#e_ij). We also consider self-loops where i = j to map recurrent consecutive subsequences into a high weighted edge. A self-loop is established if two consecutive points appear in the same cell (node). The self-loop weight is the number of times that two consecutive points that appear in the same cell (node), e.g., defined as #e_ii and #e_jj in Fig 2(c2). For instance, the self-loops for nodes v₄ and v₅ are 6 (#e_ii links) and 9 (#e_jj links), respectively as shown in Fig 2(c2). In the graph construction step, the graph is constructed using node set V and edge set E.

Algorithm 2 ConGraph

input: 2-dimensional spatial-temporal space 2DSTS, n_cell=n_c × n_c

output: G(V, E)

⊳ Node creation

1. Node set (V): 2DSTS is divided into grid cells n_cell by dividing both dimension X and Y into nc boundary using $s_{x} = \frac{x_{m a x} - x_{m i n}}{n_{c}}$ , and $s_{y} = \frac{y_{m a x} - y_{m i n}}{n_{c}}$ , each grid cell is represented by a node v_i ∈ V (as shown in Fig 2(c1)).

⊳ Edge extraction

2. Edge set (E): a directed link (v_i, v_j) is established from node v_i to node v_j if two consecutive sequences occur between two nodes and its edge weight equals the number of times that two consecutive sequences occur between two nodes (shown as e_ij in Fig 2(c2)). a self-loop link is also considered when two consecutive sequences appear in the same cell (shown as e_ii and e_jj in Fig 2(c2)).

⊳ Graph creation

3. Construct the graph G(V, E) using node set V and edge set E (as shown in Fig 2(c3)).

The steps of the ConGraph algorithm are also illustrated in Fig 2(c). Note that the creation of the graph only requires one parameter, grid size n_cell=n_c × n_c. Increasing the value of n_cell will impact the data size of storing the graph. Decreasing the value of n_cell will speed up the graph creation. However, it may result in loss of information regarding normal and abnormal patterns. In experimental evaluation, we demonstrate how we can find an optimal value of n_cell. Fig 2(c3) shows a G(V, E) graph with size of n_cell=25 (5×5). Nodes with high self-loop values correspond to normal patterns, while nodes with low self-loops values correspond to abnormal patterns.

4.3 Subsequence anomaly detection

In this subsection, we detail how we use the information in the generated graph, G(V, E), to calculate the anomaly score for each subsequence of length ℓ (T_i,ℓ) and detect abnormal subsequences as shown in Algorithm 3.

Algorithm 3 AnomalyScore

input: G(V, E), time series T, input length ℓ

output: abnormal subsequence

⊳ Transfer time series T = [t₀, t₁, …, t_n] to a path P_T = < v⁰, v¹, …, v^n−w_g+1 > in G(V, E)

1. P_T ← < >;

2. foreach i ∈ [0, n − w_g + 1] do

3. $v^{i} \in V \leftarrow T_{i, w_{g}}$ ;

4. add vⁱ in P_T;

⊳ Map each subsequence of length ℓ into path sequence P_T and calculate its score

5. foreach i ∈ [0, n − w_g − ℓ + 1] do

6. P_ℓ(i) = < vⁱ, vⁱ⁺¹, …, v^i+ℓ−1 >, v ∈ V ← T_i,ℓ;

7. $s c o r e (P_{ℓ} (i)) \leftarrow \frac{\sum_{k = i}^{i + l - 1} w (v^{k}, v^{k + 1})}{ℓ}$ ;

8. Score(i) ← movingAve(score, w_g);

9. Anomalies ← DetectAnomaly(Score, k);

In Algorithm 3, we first transfer time series T into a path, P_T, using the generated graph G (lines 1–4). The path, P_T, is a sequence of nodes, where each node, vⁱ ∈ V, represents a subsequence, $T_{i, w_{g}}$ , extracted from time series T. This is done by mapping all subsequences, ${T_{i, w_{g}}, i \in [0, ∣ T ∣ - w_{g} + 1]}$ , to their corresponding nodes in G.

Fig 5(a) shows the path sequence, P_T, for the sample time series in Fig 2(a). The P_T shows that the abnormal paths (highlighted in red) are easily distinguishable from normal path (highlighted in green). We are interested in finding abnormal sequences of length ℓ. Therefore, each subsequence of T with a length of ℓ (T_i,ℓ) is mapped into a path, P_ℓ(i), using path sequence P_T (lines 5 and 6). Fig 5(b) and 5(c) illustrate normal and abnormal path sequences corresponding to normal and abnormal subsequences of length ℓ = 110, respectively. As we mentioned before, the abnormal patterns in time series are mapped into the paths in G that have low weighted edges, and the normal patterns in time series are mapped into the paths in G that have high weighted edges (As shown in Fig 5(d) and 5(e). For instance, the normal path sequence for the normal subsequence (starting at position i=950) is $P_{ℓ} (i = 950) = < v_{14}^{0}, v_{9}^{7}, v_{14}^{11}, v_{13}^{18}, v_{18}^{21}, v_{13}^{29}, v_{7}^{36}, v_{8}^{43}, v_{12}^{45}, v_{13}^{48}, v_{18}^{49}, v_{23}^{50}, v_{22}^{51}, v_{21}^{56}, v_{16}^{61}, v_{11}^{64}, v_{6}^{67}, v_{1}^{70}, v_{2}^{77}, v_{8}^{80}, v_{13}^{86}, v_{14}^{107} >$ (the superscript following each node denotes the first occurrence of the node in the path sequence) and its path in G is shown in Fig 5(d). While the abnormal path sequence for the abnormal subsequence (starting at position i=1945) is $P_{ℓ} (i = 1945) = < v_{2}^{0}, v_{3}^{5}, v_{4}^{9}, v_{5}^{11}, v_{10}^{15}, v_{15}^{19}, v_{20}^{22}, v_{25}^{28}, v_{24}^{33}, v_{19}^{34}, v_{20}^{43}, v_{15}^{49}, v_{14}^{70}, v_{13}^{90} >$ and its path in G is shown in Fig 5(e). Thus, the path weight of the abnormal path is much smaller than the path weight of the normal path. The path weight is the sum of the weights of the edges on that path.

We use this information to calculate the anomaly score for each subsequence based on its path in G. The score for each subsequence is defined as the average path weight for each subsequence path of length ℓ in G and is calculated by dividing the path weight by path length ℓ (line 7). Then, a moving average filter is applied to the score vector to make sure that the score for highly overlapping subsequences has a relatively similar score (line 8). In the final step, we rank subsequences based on their scores (from the lowest to highest score) and report an anomaly list (rank, subsequence), which can be used to detect Top-K abnormal subsequences (line 9). The K subsequences of time series T with the lowest score are considered abnormal subsequences.

Fig 6(a) shows the subsequence’s Score (i) for sample time series in Fig 2(a). As we expect, the score for abnormal paths is much lower than the score for normal path. For instance, the scores for the normal path P_ℓ(i = 950) and the abnormal path P_ℓ(i = 1945) (Shown in Fig 5(d) and 5(e)) are 465.29 and 227.77, respectively. We exclude the trivial matches by incorporating “exclusion-zone” (shown as gray areas in Fig 6(a)) of length ℓ before and after the location of each lowest score to avoid reporting overlapping subsequences. Considering K=4 for detecting four abnormal subsequences in the sample time series, the GraphTS reports four abnormal subsequences with the lowest scores, starting at different positions at i=3685, 4435, 1945, and 2594 and ranks them based on their scores as 1^st, 2^nd, 3^rd and 4^th, respectively. Fig 6(b) illustrates the four abnormal subsequences that were correctly detected by our GraphTS method. We note that the selection of value for parameter K is not necessary as our method can provide a ranked list for all subsequences. Moreover, the generated Graph model can be used to discover anomalies with various lengths. From Algorithm 3 (line 5–9), we only need to map subsequences with various lengths ℓ ∈ [minL, maxL] into path sequence P_T and calculate their score to find anomalies with different lengths. Therefore, our proposed method can identify anomalies of different lengths much faster than methods that need to run for different lengths to detect variable length anomalies.

5 Experimental study

In this section, we have conducted extensive experiments to evaluate the accuracy and efficiency of our GraphTS method on various real-world datasets. We present the experimental setup in Section 5.1, discuss optimal parameters of GraphTS in Section 5.2, and evaluate its Top-K accuracy in Section 5.3 and efficiency (execution time) in Section 5.4.

5.1 Experimental setup

We have implemented our GraphTS in Python. The experiments were carried out on a computer with an Intel CORE i7–8650U CPU @ 1.90GHz and 16GB memory, running a 64-bit Windows 10 operating system. To ensure the reproducibility of our experiments, we have built a webpage (https://sites.google.com/view/graphts) with the source code and datasets.

We evaluated our proposed method using real datasets from various domains and UCR benchmark as shown in Tables 3 and 4. The datasets are listed as follows.

Table 3. Datasets used to evaluate the proposed method, with length of time series (n), length of anomaly (ℓ), number of anomalies (A) and domain.

Datasets		n	ℓ	A	Domain
1. SED		100 K	100	50	Electronic
2. MIT-BIH	MITBIH 1 (SAD803)	200 K	80	130	Cardiology
	MITBIH 2 (SAD820)	200 K	100	159	Cardiology
	MITBIH 3 (AD116)	200 K	200	32	Cardiology
	MITBIH 4 (AD119)	200 K	250	125	Cardiology

Open in a new tab

Table 4. List of the UCR benchmark datasets used to compare the performance of the proposed method with Series2Graph method.

Group name (GN), number of datasets in each group (#D), and File number considered in each group (#F).

GN		#D	#F
UCR datasets	ECG	25	109–111,119–126,163–166,178–180,182–183,192–196
	Internal bleeding (IB)	13	132–144
	Giat	12	127–131,167–169,170–172,181
	Insect	11	145–150,173–177
	Respiration	8	184–191
	CHARI	8	201–208
	Weather	6	113–118
	NASA	5	156–160
	Other	12	112,151–155,161–162,197–200

Open in a new tab

Simulated engine disks data set (SED) [41, 42], which contains disk revolutions time series collected at NASA Glenn Research Center’s Rotordynamics Laboratory.
MIT-BIH Supraventricular Arrhythmia Database (svdb) and Arrhythmia Database (mitdb) [43, 44], which consist of four electrocardiogram recordings with different arrhythmia (heart anomalies).
UCR Time Series Anomaly Datasets: we evaluate our proposed method on recently published benchmark dataset, the UCR Time Series Anomaly Datasets [45]. We use 100 time series (file numbers from # 109 to #208) in various domains from the UCR Time Series Anomaly Datasets [45]. We group time series in UCR benchmark datasets based on their application domains as shown in Table 4. Each dataset in UCR benchmark datasets has training and testing parts. The training part is free of anomalies while the testing part has only one anomaly or one significant anomaly if it has more than one anomaly.

Our evaluation strategy is in three steps: (1) in Section 5.2, we study the sensitivity of the proposed method on its parameters and provide the optimal values for them, and (2) in Section 5.3–5.4, we evaluate the robustness of the GraphTS for anomaly detection in terms of Top-K accuracy and execution time using real datasets and compare it with STOMP method [21] and Series2Graph [22]. (3) in Section 5.5, we evaluate the ability of the GraphTS for detecting anomalies of various lengths using 100 time series from UCR benchmark datasets and compare it with Series2Graph [22] in terms of Top-1 accuracy and execution time.

5.2 Optimal parameters of the proposed method

In this subsection, we evaluate the sensitivity of the GraphTS methods and discuss the optimal values of its three parameters: w_g, n_cell and ℓ.

(1) Effect of w_g

We first evaluate the effect of the input parameter w_g of the proposed method. As this parameter is used to generate graph G, we ensure that the GraphTS method is robust to variation of this parameter for accurately representing the patterns (normal and abnormal) in time series; this is critical to detect anomalies accurately. Due to the fact that a time series may not contain any anomaly, we set the length of w_g based on the length of normal pattern (l_np) in time series to guarantee that the generated graph can characterize the normal pattern. To evaluate the sensitivity of GraphTS to w_g, we measure Top-k accuracy, by setting k and ℓ equal to the number of anomalies, A, and the length of anomalies ℓ, respectively, in each real dataset and let n_cell =100, and then vary the length of w_g based on the length of l_np in each dataset. The l_np length can be identified easily from each dataset using the Multi-Window-Finder method [46]. For example, the l_np length for the ECG dataset is the heartbeat’s length. Table 5 shows the l_np values for each dataset. Fig 7 shows the stability of the proposed method by varying the length of w_g. The performance of the proposed method is stable when the length of w_g used to create the graph are smaller than the length of l_np, it shows that selecting a value of w_g smaller than the length of l_np will lead to better accuracy. Therefore, we set w_g = l_np-20 for the rest of the experiments.

Table 5. Selected l_np values for the SED and MITBIH datasets.

Dataset	SED	MITBIH 1	MITBIH 2	MITBIH 3	MITBIH 4
l _np	100	100	110	270	320

Open in a new tab

(2) Effect of ℓ

We evaluate the robustness of GraphTS to the variation of subsequence length ℓ. We measure Top-K accuracy, setting k equal to the number of anomalies (A) in each real dataset and w_g equal to the length of l_np-20, and then vary the length of ℓ. We use n_cell = 100 for this experiment. Fig 8 shows the top-k accuracy of the proposed method by varying the ℓ. These results indicated that we can identify anomalies with high accuracy by varying ℓ using a fixed w_g. Therefore, the proposed method is robust against anomaly length and does not need to know the exact length of the anomaly. Fig 8 also demonstrates that our proposed method is robust to the variable length of anomalies.

(3) Effect of n_cell

We evaluate the influence of node number n_cell on the performance of GraphTS and execution time for the Graph generation. We measure Top-k accuracy, setting k equal to the number of anomalies (A) in each dataset and w_g equal to the length of l_np-20 in each dataset and ℓ equal to the length of anomalies, and then vary the value of n_cell from 4 to 400. Fig 9(a) illustrates the top-k accuracy changing with the values of n_cell. Even though the performance of the proposed method drops for a small number of node cells (n_cell< 100), the performance remains stable when n_cell ≥ 100. This indicates that just satisfying n_cell≥ 100 will yield satisfactory results. Fig 9(b) shows the execution time for graph generation versus the number of node cells (n_cell). Increasing the value of n_cell will impact the graph size. Decreasing the n_cell will speed up a little the graph creation. However, it may result in information loss regarding the normal and abnormal patterns. To avoid information loss and make the GraphTS scalable on large datasets, we select n_cell = 100 for the rest of the experiments.

5.3 Top-k accuracy

In this section, we report the evaluation result based on Top-K Accuracy on both SED and MIT-BIH datasets. We compare the proposed method to the counterpart STOMP and Series2Graph methods.

We evaluate the ability of the proposed method to correctly detect k abnormal subsequences in real datasets (MIT-BIH and SED). For the GraphTS, we set w_g =l_np-20, ℓ equals to the length of anomaly, and n_cell= 100 and retrieve Top-k anomalous subsequences. For Series2Graph, we used the same value selected in [22] for its parameters, pattern length (l_p=50), and query length (l_q=75), for datasets SED and MITBIH1 and MITBIH2. For datasets MITBIH3 and MITBIH4, we set l_q= equals to the length of anomaly, and pattern length l_p=2l_q/3. Fig 10 shows the Top-k accuracy for the GraphTS, STOMP and Series2Graph for each dataset. These real datasets contain multiple anomalies. The proposed method achieves perfect accuracy and outperforms both STOMP and Series2Graph methods in total accuracy. Both GraphTS and Series2Graph achieve perfect accuracy on datasets: SED, MITBIH 1, MITBIH 3 and MITBIH 4. However, the GraphTS perform much better on MITBIH 2 dataset and obtain accuracy of 96% while the Series2Graph achieve accuracy of 67%. The STOMP method achieve lower accuracy because anomalies do not relate to uncommon subsequences as abnormal sequences with a similar pattern are repeated in these datasets. These results indicate the ability of GraphTS to accurately detect recurrent anomalies.

5.4 Efficiency

In this subsection, we report the efficiency of the proposed method in real datasets (MIT-BIH and SED) and compare it with two counterpart methods (STOMP and Series2Graph). The results of execution time changing with dataset sizes are shown in Fig 11. We use several prefix snippets (20K, 50K, 100K, 150K, 200K points on time series) of the MITBIH datasets and use several prefix snippets (20K, 40K, 60K, 80K, 100k points) of the SED dataset. For all datasets, we set k equal to the number of anoalies in each snippet. The results show that the proposed method is faster than STOMP and Series2Graph. The GraphTS is at least two orders of magnitude faster than Series2Graph and four orders of magnitude faster STOMP method. To show the scalability of GraphTS, we also report the number of nodes in each represented graph generated by GraphTs and Series2Graph. We aim to demonstrate the memory and time efficiency of our GraphTS. It is important to note that we were unable to compare with the STOMP method as it is not a graph-based method. Fig 12 shows the number of nodes in each represented graph changing with dataset sizes. As we set n_cell equal to 100, the number of nodes in all represented graphs are kept equal or below 100 and the change in time series data size does not affect the size of represented graph built by GraphTS. However, as the most crossed areas in the 2D space is considered as a node in Series2Graph, the number of nodes in represented graphs built by Series2Graph method increase by increasing the time series data size as shown in Fig 12. This result indicates the capability of GraphTS on representing long time series without increasing the number of nodes in the represented graph. Therefore, it is more time and memory efficient than Series2Graph method.

5.5 Variable-length anomaly detection

In this section, we evaluate the ability of the proposed method to detect anomalies of various lengths. We report the effectiveness of proposed method on detecting abnormal subsequences in different domain datasets using the UCR benchmark datasets. We also compare our methods with Series2Graph in terms of accuracy and execution time.

For each time series in the UCR benchmark datasets, we only focus on the testing part of each signal and aim to identify the Top-1 anomaly. This is because there is only one anomaly present in each testing time series. For the GraphTS, we set n_cell= 100 for all datasets, and w_g is set to be smaller than the length of l_np in each dataset to build GraphTS model (see S1 Table for selected values of w_g and ℓ for each dataset). For the Series2Graph, we also consider the same value of w_g that we use for our GraphTS method as pattern length to build Series2Graph model. Both generated GraphTS and Series2Graph models can be used to discover anomalies with various lengths. Therefore, we report top-1 anomaly result for different lengths in a given range from minimum to maximum length (ℓ ∈ [minL, maxL]) in each dataset. The minimum length (minL) is set to 10 and the maximum length (maxL) is set to max(100, l_np) for each dataset. As each dataset contains only one anomaly we consider the Top-1 discord for all methods and report the results as binary (detected ∣ not-detected) if the method can locate anomaly correctly for any ℓ ∈ [minL, maxL]. Then, we calculate the accuracy by dividing the number of corrected detection of each method by the total number of datasets (#D) in each group as shown in Table 4. The anomaly detection results of each methods and its execution time for each dataset are provided in S1 Table.

Fig 13 shows a summary of performance comparisons of the GrapTS with the Series2Graph method in terms of accuracy and execution time. From Fig 13(a), we can confirm that the GraphTS outperforms the Series2Graph method in terms of anomaly detection in seven groups for UCR datasets and achieves the total accuracy of 84% while Series2Graph obtains the total accuracy of 61%. Our GraphTS method outperforms the Series2Graph method in terms of execution time in all groups as shown in Fig 13(b). The total execution time results indicate that the proposed method is much faster than the counterpart Series2Graph method. The proposed GrapTS method needs only 2319s to process all datasets which is about 30 times faster than the Series2Graph. In all datasets (see S1 Table), the execution times of the proposed GrapTS method are much lower than the counterpart method. These results show the ability of the proposed GraphTS to identify anomalies with variable lengths using the same generated Graph model.

In summary, the experimental results show that the proposed GraphTS outperforms the counterparts Series2Graph and STOMP methods in terms of accuracy and execution time. The GraphTS has several merits as follow:

1) The GraphTS does not need labels for subsequences to generate graph model and can be applied in different domain.
2) The GraphTS is robust to the variation of anomaly length ℓ and does not need any information about the anomalous subsequence.
3) The GraphTS is able to correctly detect recurrent anomalies.
4) The GraphTS is much faster than the counterpart methods. The generated graph can be used to detect anomalies with different lengths.

A limitation of the current study is that the proposed method still requires the length of normal patterns in time series to build the represented graph. However, the length can be identified from time series using methods such as Multi-Window-Finder in [46]. In future work, we aim to enhance the proposed GraphTS by incorporating additional features. Firstly, we plan to explore different time series segmentation methods such as Multi-Window-Finder in [46] to automate the process of identifying window sizes based on various behaviours in the time series and setting the input parameter w_g in GraphTS. By leveraging these methods, we can improve the efficiency and adaptability of the anomaly detection process. Furthermore, we intend to leverage the represented graph in GraphTS for time series motif discovery. Building upon the generated path sequence of the time series, we will employ techniques to identify repeated node sequences that represent normal recurring patterns (motifs) in the time series data. We can gain valuable insights into the underlying patterns and structures within the time series by capturing these recurring motifs. By incorporating these advancements, we aim further enhance the effectiveness and versatility of GraphTS, making it a more robust and comprehensive tool for anomaly detection and pattern discovery in time series data.

6 Conclusion

In this paper, we have presented GraphTS, a novel subsequence anomaly detection method designed to address the limitations of existing models. Our approach overcomes the challenges of needing prior knowledge of anomaly length and quantity, as well as the inability to detect recurrent anomalies. By leveraging a graph representation of time series data, GraphTS enables the efficient detection of both rare and frequent subsequence anomalies across diverse domains. The method involves embedding time series subsequences into a 2D space using our developed 2D visualization technique, followed by constructing a graph based on this representation. Notably, GraphTS does not rely on labeled data for training and is capable of detecting anomalies of varying lengths using a single represented graph. Experimental results demonstrate that GraphTS outperforms comparable methods such as STOMP and Series2Graph in terms of accuracy and execution time.

Supporting information

S1 Table. Performance comparisons with the Series2Graph methods on UCR datasets.

We show the selected parameters, the anomaly detection results of each method as well as its execution time for each dataset in UCR archive. Green cells/red cells designate dataset where methods detect/not detect the anomaly in that dataset.

(PDF)

Click here for additional data file.^{(81.4KB, pdf)}

Data Availability

The data sets used in the evaluation section are publicly available: the MIT-BIH Supraventricular Arrhythmia Database (svdb) and Arrhythmia Database (mitdb) are available from the following repositories (https://physionet.org/content/svdb/1.0.0/, https://physionet.org/content/mitdb/1.0.0/). The simulated engine disks dataset (SED) is available from https://data.nasa.gov/dataset/Rotor-health-monitoring-combining-spin-tests-and-d/rbn3-kay3.

Funding Statement

This work was partially supported by Australia Research Council (ARC) Discovery Project (DP190100587,https://www.arc.gov.au/,GH). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Zarei R, He J, Huang G, Zhang Y. Effective and efficient detection of premature ventricular contractions based on variation of principal directions. Digital Signal Processing. 2016;50:93–102. doi: 10.1016/j.dsp.2015.12.002 [DOI] [Google Scholar]
2. Feng Y, Cai W, Yue H, Xu J, Lin Y, Chen J, et al. An improved X-means and isolation forest based methodology for network traffic anomaly detection. Plos one. 2022;17(1):e0263423. doi: 10.1371/journal.pone.0263423 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Chiu B, Keogh E, Lonardi S. Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining; 2003. p. 493–498.
4.Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, et al. Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE; 2016. p. 1317–1322.
5. Linardi M, Zhu Y, Palpanas T, Keogh E. Matrix profile goes MAD: variable-length motif and discord discovery in data series. Data Mining and Knowledge Discovery. 2020;34:1022–1071. doi: 10.1007/s10618-020-00685-w [DOI] [Google Scholar]
6. Yoshihara K, Takahashi K. A simple method for unsupervised anomaly detection: An application to Web time series data. PloS one. 2022;17(1):e0262463. doi: 10.1371/journal.pone.0262463 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Guo A, Smith S, Khan YM, Langabeer JR II, Foraker RE. Application of a time-series deep learning model to predict cardiac dysrhythmias in electronic health records. PloS one. 2021;16(9):e0239007. doi: 10.1371/journal.pone.0239007 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Shaw P, Barr JR, Abu-Khzam FN. Anomaly detection via correlation clustering. In: 2022 IEEE 16th International Conference on Semantic Computing (ICSC). IEEE; 2022. p. 307–313.
9.Abbas N, Nasser Y, Shehab M, Sharafeddine S. Attack-specific feature selection for anomaly detection in software-defined networks. In: 2021 3rd IEEE middle east and north Africa communications conference (menacomm). IEEE; 2021. p. 142–146.
10. Gupta M, Gao J, Aggarwal C, Han J. Outlier detection for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery. 2014;5(1):1–129. doi: 10.1007/978-3-031-01905-0_1 [DOI] [Google Scholar]
11.Boniol P, Linardi M, Roncallo F, Palpanas T. SAD: an unsupervised system for subsequence anomaly detection. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE; 2020. p. 1778–1781.
12. Kondylakis H, Dayan N, Zoumpatianos K, Palpanas T. Coconut: sortable summarizations for scalable indexes over static and streaming data series. The VLDB Journal. 2019;28(6):847–869. doi: 10.1007/s00778-019-00573-w [DOI] [Google Scholar]
13.Hadjem M, Naït-Abdesselam F, Khokhar A. ST-segment and T-wave anomalies prediction in an ECG data using RUSBoost. In: 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom). IEEE; 2016. p. 1–6.
14. Zarei R, He J, Siuly S, Huang G, Zhang Y. Exploring Douglas-Peucker algorithm in the detection of epileptic seizure from multicategory EEG signals. Hindawi; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Judith AM, Priya SB, Mahendran RK, Gadekallu TR, Ambati LS. Two-phase classification: ANN and A-SVM classifiers on motor imagery BCI. ASIAN JOURNAL OF CONTROL. 2022. [Google Scholar]
16. Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo AP, et al. Time series anomaly discovery with grammar-based compression. In: Edbt; 2015. p. 481–492. [Google Scholar]
17. Rasheed F, Alhajj R. A framework for periodic outlier pattern detection in time-series sequences. IEEE transactions on cybernetics. 2013;44(5):569–582. doi: 10.1109/TSMCC.2013.2261984 [DOI] [PubMed] [Google Scholar]
18.Yang J, Wang W, Yu PS. Infominer: mining surprising periodic patterns. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining; 2001. p. 395–400.
19.Wei L, Keogh E, Xi X. Saxually explicit images: Finding unusual shapes. In: Sixth International Conference on Data Mining (ICDM’06). IEEE; 2006. p. 711–720.
20. Yankov D, Keogh E, Rebbapragada U. Disk aware discord discovery: Finding unusual time series in terabyte sized datasets. Knowledge and Information Systems. 2008;17(2):241–262. doi: 10.1007/s10115-008-0131-9 [DOI] [Google Scholar]
21.Zhu Y, Zimmerman Z, Senobari NS, Yeh CCM, Funning G, Mueen A, et al. Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE; 2016. p. 739–748.
22. Boniol P, Palpanas T. Series2graph: Graph-based subsequence anomaly detection for time series. Proceedings of the VLDB Endowment. 2020;13(12):1821–1834. doi: 10.14778/3415478.3415514 [DOI] [Google Scholar]
23. Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo AP, et al. Grammarviz 3.0: Interactive discovery of variable-length time series patterns. ACM Transactions on Knowledge Discovery from Data (TKDD). 2018;12(1):1–28. doi: 10.1145/3051126 [DOI] [Google Scholar]
24. Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee SH, Handley J. Compression-based data mining of sequential data. Data Mining and Knowledge Discovery. 2007;14(1):99–129. doi: 10.1007/s10618-006-0049-3 [DOI] [Google Scholar]
25. Liu Y, Chen X, Wang F, Yin J. Efficient detection of discords for time series stream. In: Advances in Data and Web Management. Springer; 2009. p. 629–634. [Google Scholar]
26.Fu AWC, Leung OTW, Keogh E, Lin J. Finding time series discords based on haar transform. In: International Conference on Advanced Data Mining and Applications. Springer; 2006. p. 31–41.
27.Bu Y, Leung TW, Fu AWC, Keogh E, Pei J, Meshkin S. Wat: Finding top-k discords in time series database. In: Proceedings of the 2007 SIAM International Conference on Data Mining. SIAM; 2007. p. 449–454.
28.Luo W, Gallagher M. Faster and parameter-free discord search in quasi-periodic time series. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer; 2011. p. 135–148.
29.Keogh E, Lin J, Fu A. Hot sax: Efficiently finding the most unusual time series subsequence. In: Fifth IEEE International Conference on Data Mining (ICDM’05). Ieee; 2005. p. 8–pp.
30.Lkhagva B, Suzuki Y, Kawagoe K. New time series data representation ESAX for financial applications. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE; 2006. p. x115–x115.
31. Sun Y, Li J, Liu J, Sun B, Chow C. An improvement of symbolic aggregate approximation distance measure for time series. Neurocomputing. 2014;138:189–198. doi: 10.1016/j.neucom.2014.01.045 [DOI] [Google Scholar]
32.Nakamura T, Imamura M, Mercer R, Keogh E. MERLIN: Parameter-Free Discovery of Arbitrary Length Anomalies in Massive Time Series Archives. In: 2020 IEEE International Conference on Data Mining (ICDM). IEEE; 2020. p. 1190–1195.
33. Yang Z, Fan D, Wang Q, Luan G. Sharp decrease in the Laplacian matrix rank of phase-space graphs: a potential biomarker in epilepsy. Cognitive Neurodynamics. 2021; p. 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Jiang Y, Bao X, Hao S, Zhao H, Li X, Wu X. Monthly Streamflow Forecasting Using ELM-IPSO Based on Phase Space Reconstruction. Water Resources Management. 2020;34(11):3515–3531. doi: 10.1007/s11269-020-02631-3 [DOI] [Google Scholar]
35. Marwan N, Donges JF, Zou Y, Donner RV, Kurths J. Complex network approach for recurrence analysis of time series. Physics Letters A. 2009;373(46):4246–4254. doi: 10.1016/j.physleta.2009.09.042 [DOI] [Google Scholar]
36. Scarsoglio S, Cazzato F, Ridolfi L. From time-series to complex networks: Application to the cerebrovascular flow patterns in atrial fibrillation. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2017;27(9):093107. doi: 10.1063/1.5003791 [DOI] [PubMed] [Google Scholar]
37. Supriya S, Siuly S, Wang H, Zhang Y. New feature extraction for automated detection of epileptic seizure using complex network framework. Applied Acoustics. 2021;180:108098. doi: 10.1016/j.apacoust.2021.108098 [DOI] [Google Scholar]
38. Li G, Jung JJ. Dynamic graph embedding for outlier detection on multiple meteorological time series. Plos one. 2021;16(2):e0247119. doi: 10.1371/journal.pone.0247119 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Farag A, Abdelkader H, Salem R. Parallel graph-based anomaly detection technique for sequential data. Journal of King Saud University-Computer and Information Sciences. 2019;. [Google Scholar]
40. Ali M, Jones MW, Xie X, Williams M. TimeCluster: dimension reduction applied to temporal data for visual analytics. The Visual Computer. 2019;35(6):1013–1026. doi: 10.1007/s00371-019-01673-y [DOI] [Google Scholar]
41. Abdul-Aziz A, Woike MR, Oza NC, Matthews BL, lekki JD. Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Structural Health Monitoring. 2012;11(1):3–12. doi: 10.1177/1475921710395811 [DOI] [Google Scholar]
42.Abdul-Aziz A, Woike M, Oza N, Matthews B, Baakilini G. Propulsion health monitoring of a turbine engine disk using spin test data. In: Health Monitoring of Structural and Biological Systems 2010. vol. 7650. International Society for Optics and Photonics; 2010. p. 76501B.
43. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet. Circulation. 2000;101(23):e215–e220. doi: 10.1161/01.CIR.101.23.e215 [DOI] [PubMed] [Google Scholar]
44. Moody GB, Mark RG. The impact of the MIT-BIH arrhythmia database. IEEE Engineering in Medicine and Biology Magazine. 2001;20(3):45–50. doi: 10.1109/51.932724 [DOI] [PubMed] [Google Scholar]
45. Wu R, Keogh E. Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. IEEE Transactions on Knowledge and Data Engineering. 2021. doi: 10.1109/TKDE.2021.3112126 [DOI] [Google Scholar]
46. Imani S, Abdoli A, Beyram A, Imani A, Keogh E. Multi-Window-Finder: Domain Agnostic Window Size for Time Series Data; 2021. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0290092.r001

Decision Letter 0

Vijayalakshmi Kakulapati

2 Aug 2022

PONE-D-22-15666GraphTS: Graph-Represented Time Series for Subsequence Anomaly DetectionPLOS ONE

Dear Dr. Huang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 11 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vijayalakshmi Kakulapati, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The above article currently has several weaknesses, which are described below.

1: the Related Work section is petite. Please add more work to this section and discuss it briefly.

2: Please provide a detailed description of your proposed model.

3: Specify the limitations and drawbacks of the proposed method.

4: It is recommended to re-examine and design the Figures.

5: A deep and detailed comparison with other methods is mandatory.

6: The authors should also clarify the motivation and main contribution of applied approach more clearly in the introduction and conclusion sections.

7: The results and discussion section has to be improved, where more details of the achieved results should be stated clearly in this section. In addition, authors also have to provide some insight discussion of the results.

Reviewer #2: This article proposes a graph-based anomaly detection method for time series data. The proposed method, GraphTS,

aims to do efficient detection of both recurrent and rare anomalies.

This article is well-organized, well-explained, and easy to read. However, I have some concerns.

Concerns:

1. Image quality is very poor. For example, it was hard to determine what's going on in (c1) and (c2) in Figure 2. The edge

labels in (C3) of the same figure is almost illegible. Please replace the current images with high quality ones. Figure 3 is so blurry that the points and links between points are hard to distinguish. Better quality images would be help to understand the method well.

2. It's confusing that Z has been represented as a vector in Algorithm 1 (but it's actually a 2D matrix)? The

symbols and notations used in the algorithm should be defined clearly.

3. The authors claim, in section 5.3, that the proposed method outperforms both STOMP and Series2Graph methods. This

doesn't really give an accurate picture of the comparison. Out of 5 datasets, the proposed method outperforms Series2Graph on only one and performs the same on the remaining datasets. The authors should revise their claim and make it more specific. The one dataset on which the proposed method performs better, it does 20% better than Series2Graph which is satisfactory. However, if we consider the overall performance comparison shown in Figure 10, the proposed method does not do significantly better than Series2Graph. It would be more appealing if the authors used more datasets and could show that their proposed method performs better on at least about 50% of the datasets used in the experiment.

However, considering the scalability factor, the proposed method does have some useful contributions.

Minor issues:

Some grammar and spelling errors exist. For example, it should be "rest", not "reset", in the paragraph under Table 1.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Aug 16;18(8):e0290092. doi: 10.1371/journal.pone.0290092.r002

Author response to Decision Letter 0

2 Feb 2023

Response to reviewers’ comments for: GraphTS: Graph-Represented Time Series for Subsequence Anomaly Detection

We thank the editor and all the reviewers for their constructive feedback to help improve this paper. Below please find the point-to-point responses to the comments raised by the reviewers. Accordingly, we highlighted our modifications in the new version.

Response to Reviewer 1

Comment 1: the Related Work section is petite. Please add more work to this section and discuss it briefly.

Response:

As suggested, we have added more related works (i.e., [39] [42] [43] [44]) and discussions to the Related Work section (see page 4, lines 121-157).

Comment 2: Please provide a detailed description of your proposed model.

Response:

We have added more overall description of the GraphTS method in the Introduction section of the revised version (see page 2, lines 46-60). Table 2 also summarizes the three steps of the GraphTS method, and we provide an example to explain the three steps in Figure 2. Accordingly, Sections 4.1 (Algorithm 1), 4.2 (Algorithm 2) and 4.3 (Algorithm 3) have been developed in details to explain the mechanism of each step.

Comment 3: Specify the limitations and drawbacks of the proposed method.

Response:

We have provided the limitation and future work for the proposed method (see page 17, lines 565-571)

Comment 4: It is recommended to re-examine and design the Figures.

Response:

We have re-examed all figures and replaced them with high-quality ones. Particularly, we have redesigned Figure 2 to ensure all subfigures are clear and readable.

Comment 5: A deep and detailed comparison with other methods is mandatory.

Response:

In the revised version, we have conducted a deep and detailed comparison with other methods (STOMP and Series2Graph) by adding more experimental results (see Figure 13 and S1 Table) on more datasets (see page 12, lines 412-419 and Table 4) and added the whole Section 5.5 for detailed explanation.

Comment 6: The authors should also clarify the motivation and main contribution of applied approach more clearly in the introduction and conclusion sections.

Response:

As suggested, we have revised the introduction and conclusion sections to clarify our motivation and contribution (see introduction section lines 86-103 on page 3 and conclusion section on page 17).

Comment 7: The results and discussion section has to be improved, where more details of the achieved results should be stated clearly in this section. In addition, authors also have to provide some insight discussion of the results.

Response:

We have improved the result and discussion section as suggested. We have added more detailed results and insight in the revised paper (see Section 5.5, Figure 13 and S1 Table)

Response to Reviewer 2

Comment 1. Image quality is very poor. For example, it was hard to determine what's going on in (c1) and (c2) in Figure 2. The edge labels in (C3) of the same figure is almost illegible. Please replace the current images with high quality ones. Figure 3 is so blurry that the points and links between points are hard to distinguish. Better quality images would be help to understand the method well.

Response:

We have re-examed Figure 3 and all of the other figures and replaced them with high-quality ones. Particularly, we have redesigned Figure 2 to ensure all subfigures are clear and readable.

Comment 2. It's confusing that Z has been represented as a vector in Algorithm 1 (but it's actually a 2D matrix)? The symbols and notations used in the algorithm should be defined clearly.

Response:

We have revised Algorithm 1 and provided a clear definition for matrix Z (see Algorithm 1 and lines 280-281 on page 8). We have also included a list of symbols in Table 1 on Page 7.

Comment 3. The authors claim, in section 5.3, that the proposed method outperforms both STOMP and Series2Graph methods. This doesn't really give an accurate picture of the comparison. Out of 5 datasets, the proposed method outperforms Series2Graph on only one and performs the same on the remaining datasets. The authors should revise their claim and make it more specific. The one dataset on which the proposed method performs better, it does 20% better than Series2Graph which is satisfactory. However, if we consider the overall performance comparison shown in Figure 10, the proposed method does not do significantly better than Series2Graph. It would be more appealing if the authors used more datasets and could show that their proposed method performs better on at least about 50% of the datasets used in the experiment. However, considering the scalability factor, the proposed method does have some useful contributions.

Response:

To enhance the evaluation, we have added more datasets (100 time series for seven groups of UCR datasets listed in Table 4) to compare our method with the Series2Graph method. Now the experimental results show that the proposed GrapTS outperforms Series2Graph in total accuracy while spending less runtime; that is, GrapTS absolutely excels Series2Graph in seven groups and achieves the same accuracy in two groups. (see Section 5.5, Figure 13 and S1 Table). We also have revised our claim in Section 5.3 (see lines 489-492 on page 15).

Comment 4. Minor issues: Some grammar and spelling errors exist. For example, it should be "rest", not "reset", in the paragraph under Table 1.

Response:

We have corrected the errors as suggested. Moreover, we checked the whole paper for grammar, spelling and punctuation mistakes and made the corrections in the revised version.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(19.7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0290092.r003

Decision Letter 1

Vijayalakshmi Kakulapati

7 Jun 2023

PONE-D-22-15666R1GraphTS: Graph-Represented Time Series for Subsequence Anomaly DetectionPLOS ONE

Dear Dr. Huang,

Please submit your revised manuscript by Jul 21 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Vijayalakshmi Kakulapati, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

Reviewer #3: (No Response)

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

Reviewer #3: (No Response)

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: N/A

Reviewer #3: (No Response)

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

Reviewer #3: (No Response)

Reviewer #4: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

Reviewer #3: (No Response)

Reviewer #4: (No Response)

**********

6. Review Comments to the Author

Reviewer #2: This article proposes a graph-based anomaly detection method for time series data. The proposed method, GraphTS, aims to do efficient detection of both recurrent and rare anomalies.

I'll repeat from my first review that this article is well-organized, well-explained, and easy to read.

I had several major concerns that the authors have addressed in their revised manuscript. One of which was about the lack of enough data and some claims they made that I didn't find convincing.

However, I'm satisfied with their explanation and revision.

Reviewer #3: Author should add keyword list.keywors list contain 5 to 8 words.

Conclusion to be made more systematic and future scope to be elaborated more on technical features that are planned to be added in the proposed system in the near future.

author add more referecne in introducation as below

1-Anomaly Detection via Correlation Clustering

2-Attack-Specific Feature Selection for Anomaly Detection in Software-Defined Networks

3-Two-phase classification: ANN and A-SVM classifiers on motor imagery BCI

Authors should further explain equations and maths. It is too hard to udnerstnad at the moemnt. Secondly, if these are general maths easily available on the internet, then authors should remove it and add reference instread.

The use of English language is fine, however, it is recommended to be checked once again.

Reviewer #4: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

**********

PLoS One. 2023 Aug 16;18(8):e0290092. doi: 10.1371/journal.pone.0290092.r004

Author response to Decision Letter 1

18 Jul 2023

Response to reviewers’ comments for: GraphTS: Graph-Represented Time Series for Subsequence Anomaly Detection

Response to Reviewer 2

Comment 1: This article proposes a graph-based anomaly detection method for time series data. The proposed method, GraphTS, aims to do efficient detection of both recurrent and rare anomalies. I'll repeat from my first review that this article is well-organized, well-explained, and easy to read. I had several major concerns that the authors have addressed in their revised manuscript. One of which was about the lack of enough data and some claims they made that I didn't find convincing. However, I'm satisfied with their explanation and revision.

Response:

Thank you for your positive feedback on our revised manuscript. We appreciate your contribution to improving our work.

Response to Reviewer 3

Comment 1. Author should add keyword list. keywords list contain 5 to 8 words.

Response:

As suggested, we have added a keywords list in the revised paper (see keywords section, page 1).

Comment 2. Conclusion to be made more systematic and future scope to be elaborated more on technical features that are planned to be added in the proposed system in the near future.

Response:

We have revised the conclusion section (see conclusion section, page 17) and elaborated on the future work in the revised paper (see lines 559-572 on page 17).

Comment 3. author add more reference in introduction as below

1-Anomaly Detection via Correlation Clustering

2-Attack-Specific Feature Selection for Anomaly Detection in Software-Defined Networks

3-Two-phase classification: ANN and A-SVM classifiers on motor imagery BCI

Response:

As suggested, we have added more related works (i.e., [45] [46] [47]) to the introduction section.

Comment 4. Authors should further explain equations and maths. It is too hard to understand at the moment. Secondly, if these are general maths easily available on the internet, then authors should remove it and add reference instead.

Response:

We have added explanation for Eq. (1) and cited reference for Definitions 1-4.

Comment 5. The use of English language is fine, however, it is recommended to be checked once again.

Response:

We have checked the whole paper for grammar, spelling and punctuation mistakes and made the corrections in the revised version.

Attachment

Submitted filename: Response to Reviewers comments.docx

Click here for additional data file.^{(17.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0290092.r005

Decision Letter 2

Vijayalakshmi Kakulapati

2 Aug 2023

GraphTS: Graph-Represented Time Series for Subsequence Anomaly Detection

PONE-D-22-15666R2

Dear Dr. Huang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vijayalakshmi Kakulapati, Ph.D

Academic Editor

PLOS ONE

Reviewers' comments:

**********

PLoS One. doi: 10.1371/journal.pone.0290092.r006

Acceptance letter

Vijayalakshmi Kakulapati

7 Aug 2023

PONE-D-22-15666R2

GraphTS: Graph-Represented Time Series for Subsequence Anomaly Detection

Dear Dr. Huang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Vijayalakshmi Kakulapati

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Table. Performance comparisons with the Series2Graph methods on UCR datasets.

(PDF)

Click here for additional data file.^{(81.4KB, pdf)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(19.7KB, docx)}

Attachment

Submitted filename: Response to Reviewers comments.docx

Click here for additional data file.^{(17.2KB, docx)}

Data Availability Statement

[pone.0290092.ref001] 1. Zarei R, He J, Huang G, Zhang Y. Effective and efficient detection of premature ventricular contractions based on variation of principal directions. Digital Signal Processing. 2016;50:93–102. doi: 10.1016/j.dsp.2015.12.002 [DOI] [Google Scholar]

[pone.0290092.ref002] 2. Feng Y, Cai W, Yue H, Xu J, Lin Y, Chen J, et al. An improved X-means and isolation forest based methodology for network traffic anomaly detection. Plos one. 2022;17(1):e0263423. doi: 10.1371/journal.pone.0263423 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290092.ref003] 3.Chiu B, Keogh E, Lonardi S. Probabilistic discovery of time series motifs. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining; 2003. p. 493–498.

[pone.0290092.ref004] 4.Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, et al. Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE; 2016. p. 1317–1322.

[pone.0290092.ref005] 5. Linardi M, Zhu Y, Palpanas T, Keogh E. Matrix profile goes MAD: variable-length motif and discord discovery in data series. Data Mining and Knowledge Discovery. 2020;34:1022–1071. doi: 10.1007/s10618-020-00685-w [DOI] [Google Scholar]

[pone.0290092.ref006] 6. Yoshihara K, Takahashi K. A simple method for unsupervised anomaly detection: An application to Web time series data. PloS one. 2022;17(1):e0262463. doi: 10.1371/journal.pone.0262463 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290092.ref007] 7. Guo A, Smith S, Khan YM, Langabeer JR II, Foraker RE. Application of a time-series deep learning model to predict cardiac dysrhythmias in electronic health records. PloS one. 2021;16(9):e0239007. doi: 10.1371/journal.pone.0239007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290092.ref008] 8.Shaw P, Barr JR, Abu-Khzam FN. Anomaly detection via correlation clustering. In: 2022 IEEE 16th International Conference on Semantic Computing (ICSC). IEEE; 2022. p. 307–313.

[pone.0290092.ref009] 9.Abbas N, Nasser Y, Shehab M, Sharafeddine S. Attack-specific feature selection for anomaly detection in software-defined networks. In: 2021 3rd IEEE middle east and north Africa communications conference (menacomm). IEEE; 2021. p. 142–146.

[pone.0290092.ref010] 10. Gupta M, Gao J, Aggarwal C, Han J. Outlier detection for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery. 2014;5(1):1–129. doi: 10.1007/978-3-031-01905-0_1 [DOI] [Google Scholar]

[pone.0290092.ref011] 11.Boniol P, Linardi M, Roncallo F, Palpanas T. SAD: an unsupervised system for subsequence anomaly detection. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE; 2020. p. 1778–1781.

[pone.0290092.ref012] 12. Kondylakis H, Dayan N, Zoumpatianos K, Palpanas T. Coconut: sortable summarizations for scalable indexes over static and streaming data series. The VLDB Journal. 2019;28(6):847–869. doi: 10.1007/s00778-019-00573-w [DOI] [Google Scholar]

[pone.0290092.ref013] 13.Hadjem M, Naït-Abdesselam F, Khokhar A. ST-segment and T-wave anomalies prediction in an ECG data using RUSBoost. In: 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom). IEEE; 2016. p. 1–6.

[pone.0290092.ref014] 14. Zarei R, He J, Siuly S, Huang G, Zhang Y. Exploring Douglas-Peucker algorithm in the detection of epileptic seizure from multicategory EEG signals. Hindawi; 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290092.ref015] 15. Judith AM, Priya SB, Mahendran RK, Gadekallu TR, Ambati LS. Two-phase classification: ANN and A-SVM classifiers on motor imagery BCI. ASIAN JOURNAL OF CONTROL. 2022. [Google Scholar]

[pone.0290092.ref016] 16. Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo AP, et al. Time series anomaly discovery with grammar-based compression. In: Edbt; 2015. p. 481–492. [Google Scholar]

[pone.0290092.ref017] 17. Rasheed F, Alhajj R. A framework for periodic outlier pattern detection in time-series sequences. IEEE transactions on cybernetics. 2013;44(5):569–582. doi: 10.1109/TSMCC.2013.2261984 [DOI] [PubMed] [Google Scholar]

[pone.0290092.ref018] 18.Yang J, Wang W, Yu PS. Infominer: mining surprising periodic patterns. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining; 2001. p. 395–400.

[pone.0290092.ref019] 19.Wei L, Keogh E, Xi X. Saxually explicit images: Finding unusual shapes. In: Sixth International Conference on Data Mining (ICDM’06). IEEE; 2006. p. 711–720.

[pone.0290092.ref020] 20. Yankov D, Keogh E, Rebbapragada U. Disk aware discord discovery: Finding unusual time series in terabyte sized datasets. Knowledge and Information Systems. 2008;17(2):241–262. doi: 10.1007/s10115-008-0131-9 [DOI] [Google Scholar]

[pone.0290092.ref021] 21.Zhu Y, Zimmerman Z, Senobari NS, Yeh CCM, Funning G, Mueen A, et al. Matrix profile ii: Exploiting a novel algorithm and gpus to break the one hundred million barrier for time series motifs and joins. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE; 2016. p. 739–748.

[pone.0290092.ref022] 22. Boniol P, Palpanas T. Series2graph: Graph-based subsequence anomaly detection for time series. Proceedings of the VLDB Endowment. 2020;13(12):1821–1834. doi: 10.14778/3415478.3415514 [DOI] [Google Scholar]

[pone.0290092.ref023] 23. Senin P, Lin J, Wang X, Oates T, Gandhi S, Boedihardjo AP, et al. Grammarviz 3.0: Interactive discovery of variable-length time series patterns. ACM Transactions on Knowledge Discovery from Data (TKDD). 2018;12(1):1–28. doi: 10.1145/3051126 [DOI] [Google Scholar]

[pone.0290092.ref024] 24. Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee SH, Handley J. Compression-based data mining of sequential data. Data Mining and Knowledge Discovery. 2007;14(1):99–129. doi: 10.1007/s10618-006-0049-3 [DOI] [Google Scholar]

[pone.0290092.ref025] 25. Liu Y, Chen X, Wang F, Yin J. Efficient detection of discords for time series stream. In: Advances in Data and Web Management. Springer; 2009. p. 629–634. [Google Scholar]

[pone.0290092.ref026] 26.Fu AWC, Leung OTW, Keogh E, Lin J. Finding time series discords based on haar transform. In: International Conference on Advanced Data Mining and Applications. Springer; 2006. p. 31–41.

[pone.0290092.ref027] 27.Bu Y, Leung TW, Fu AWC, Keogh E, Pei J, Meshkin S. Wat: Finding top-k discords in time series database. In: Proceedings of the 2007 SIAM International Conference on Data Mining. SIAM; 2007. p. 449–454.

[pone.0290092.ref028] 28.Luo W, Gallagher M. Faster and parameter-free discord search in quasi-periodic time series. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer; 2011. p. 135–148.

[pone.0290092.ref029] 29.Keogh E, Lin J, Fu A. Hot sax: Efficiently finding the most unusual time series subsequence. In: Fifth IEEE International Conference on Data Mining (ICDM’05). Ieee; 2005. p. 8–pp.

[pone.0290092.ref030] 30.Lkhagva B, Suzuki Y, Kawagoe K. New time series data representation ESAX for financial applications. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06). IEEE; 2006. p. x115–x115.

[pone.0290092.ref031] 31. Sun Y, Li J, Liu J, Sun B, Chow C. An improvement of symbolic aggregate approximation distance measure for time series. Neurocomputing. 2014;138:189–198. doi: 10.1016/j.neucom.2014.01.045 [DOI] [Google Scholar]

[pone.0290092.ref032] 32.Nakamura T, Imamura M, Mercer R, Keogh E. MERLIN: Parameter-Free Discovery of Arbitrary Length Anomalies in Massive Time Series Archives. In: 2020 IEEE International Conference on Data Mining (ICDM). IEEE; 2020. p. 1190–1195.

[pone.0290092.ref033] 33. Yang Z, Fan D, Wang Q, Luan G. Sharp decrease in the Laplacian matrix rank of phase-space graphs: a potential biomarker in epilepsy. Cognitive Neurodynamics. 2021; p. 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290092.ref034] 34. Jiang Y, Bao X, Hao S, Zhao H, Li X, Wu X. Monthly Streamflow Forecasting Using ELM-IPSO Based on Phase Space Reconstruction. Water Resources Management. 2020;34(11):3515–3531. doi: 10.1007/s11269-020-02631-3 [DOI] [Google Scholar]

[pone.0290092.ref035] 35. Marwan N, Donges JF, Zou Y, Donner RV, Kurths J. Complex network approach for recurrence analysis of time series. Physics Letters A. 2009;373(46):4246–4254. doi: 10.1016/j.physleta.2009.09.042 [DOI] [Google Scholar]

[pone.0290092.ref036] 36. Scarsoglio S, Cazzato F, Ridolfi L. From time-series to complex networks: Application to the cerebrovascular flow patterns in atrial fibrillation. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2017;27(9):093107. doi: 10.1063/1.5003791 [DOI] [PubMed] [Google Scholar]

[pone.0290092.ref037] 37. Supriya S, Siuly S, Wang H, Zhang Y. New feature extraction for automated detection of epileptic seizure using complex network framework. Applied Acoustics. 2021;180:108098. doi: 10.1016/j.apacoust.2021.108098 [DOI] [Google Scholar]

[pone.0290092.ref038] 38. Li G, Jung JJ. Dynamic graph embedding for outlier detection on multiple meteorological time series. Plos one. 2021;16(2):e0247119. doi: 10.1371/journal.pone.0247119 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0290092.ref039] 39. Farag A, Abdelkader H, Salem R. Parallel graph-based anomaly detection technique for sequential data. Journal of King Saud University-Computer and Information Sciences. 2019;. [Google Scholar]

[pone.0290092.ref040] 40. Ali M, Jones MW, Xie X, Williams M. TimeCluster: dimension reduction applied to temporal data for visual analytics. The Visual Computer. 2019;35(6):1013–1026. doi: 10.1007/s00371-019-01673-y [DOI] [Google Scholar]

[pone.0290092.ref041] 41. Abdul-Aziz A, Woike MR, Oza NC, Matthews BL, lekki JD. Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Structural Health Monitoring. 2012;11(1):3–12. doi: 10.1177/1475921710395811 [DOI] [Google Scholar]

[pone.0290092.ref042] 42.Abdul-Aziz A, Woike M, Oza N, Matthews B, Baakilini G. Propulsion health monitoring of a turbine engine disk using spin test data. In: Health Monitoring of Structural and Biological Systems 2010. vol. 7650. International Society for Optics and Photonics; 2010. p. 76501B.

[pone.0290092.ref043] 43. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet. Circulation. 2000;101(23):e215–e220. doi: 10.1161/01.CIR.101.23.e215 [DOI] [PubMed] [Google Scholar]

[pone.0290092.ref044] 44. Moody GB, Mark RG. The impact of the MIT-BIH arrhythmia database. IEEE Engineering in Medicine and Biology Magazine. 2001;20(3):45–50. doi: 10.1109/51.932724 [DOI] [PubMed] [Google Scholar]

[pone.0290092.ref045] 45. Wu R, Keogh E. Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. IEEE Transactions on Knowledge and Data Engineering. 2021. doi: 10.1109/TKDE.2021.3112126 [DOI] [Google Scholar]

[pone.0290092.ref046] 46. Imani S, Abdoli A, Beyram A, Imani A, Keogh E. Multi-Window-Finder: Domain Agnostic Window Size for Time Series Data; 2021. [Google Scholar]

PERMALINK

GraphTS: Graph-represented time series for subsequence anomaly detection

Roozbeh Zarei

Guangyan Huang

Junfeng Wu

Roles

Abstract

1 Introduction

Fig 1. Importance of subsequence length in anomaly detection.

2 Related work

3 The problem modeling

Table 1. Table of symbols.

4 The proposed approach

Table 2. Overview of the proposed method.

Fig 2. Procedure of the proposed graph-based subsequence anomaly detection (GraphTS).

4.1 2D visualization of time series

Fig 3. 2DSTS visualization.

Fig 4. Temporal patterns in 2DSTS visualization.

4.2 Construction of graph

4.3 Subsequence anomaly detection

Fig 5. Normal and abnormal path sequences and their corresponding paths in graph G.

Fig 6. Subsequence anomaly detection.

5 Experimental study

5.1 Experimental setup

Table 3. Datasets used to evaluate the proposed method, with length of time series (n), length of anomaly (ℓ), number of anomalies (A) and domain.

Table 4. List of the UCR benchmark datasets used to compare the performance of the proposed method with Series2Graph method.

5.2 Optimal parameters of the proposed method

(1) Effect of wg

Table 5. Selected lnp values for the SED and MITBIH datasets.

Fig 7. GraphTS Top-K accuracy vs variation on window size (wg).

(2) Effect of ℓ

Fig 8. Top-K accuracy of GraphTS changing with ℓ.

(3) Effect of ncell

Fig 9. Effect of ncell on GraphTS performance.

5.3 Top-k accuracy

Fig 10. Top-K accuracy.

5.4 Efficiency

Fig 11. Execution time changing with time series data sizes.

Fig 12. Number of nodes in represented graph changing with time series data sizes.

5.5 Variable-length anomaly detection

Fig 13. Summary of performance comparisons of the proposed method with the Series2Graph method on UCR benchmark datasets.

6 Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Vijayalakshmi Kakulapati

Roles

Author response to Decision Letter 0

Decision Letter 1

Vijayalakshmi Kakulapati

Roles

Author response to Decision Letter 1

Decision Letter 2

Vijayalakshmi Kakulapati

Roles

Acceptance letter

Vijayalakshmi Kakulapati

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

(1) Effect of w_g

Table 5. Selected l_np values for the SED and MITBIH datasets.

Fig 7. GraphTS Top-K accuracy vs variation on window size (w_g).

(3) Effect of n_cell

Fig 9. Effect of n_cell on GraphTS performance.