Skip to main content
PLOS One logoLink to PLOS One
. 2022 Dec 15;17(12):e0278146. doi: 10.1371/journal.pone.0278146

clusTransition: An R package for monitoring transition in cluster solutions of temporal datasets

Muhammad Atif 1,2,*, Friedrich Leisch 2
Editor: Mohammad Mehdi Rashidi3
PMCID: PMC9754593  PMID: 36520935

Abstract

Clustering analysis’ primary purpose is to divide a dataset into a finite number of segments based on the similarities between items. In recent years, a significant amount of study has focused on the spatio-temporal aspects of clustering. However, clusters are no longer regarded as static objects since changes influence them in the underlying population. This paper describes an R package implementing the MONIC framework for tracing the evolution of clusters extracted from temporal datasets. The name of the package is clusTransition, which stands for Cluster Transition. The algorithm is based on re-clustering cumulative datasets that evolve at successive time-points and monitoring the transitions experienced by the clusters in these clustering solutions. This paper’s contribution is to demonstrate how the package clusTransition is developed in the R programming language, and its workflow is discussed using hypothetical and real-life datasets.

1 Introduction

The prime goal of clustering analysis is the organization of a dataset into a finite number of segments according to the similarities within objects. Ideally, the set of objects in the same segment should be comparably similar to one another than to the objects belonging to different partitions [1]. Each individual partition is known as a cluster, whereas the objects belonging to the same cluster are called its members [2, 3]. Its applications covers many real-world applications, ranging from business and economics, marketing, pattern recognition, medical sciences, image processing to big data analysis [4]. For example, in the field of market segmentation, better marketing strategies can be adopted by clustering the customers with similar demographic or buying characteristics [5]. In a similar passion, clustering might be helpful in better understanding the disease and targeting appropriate treatment by sub-grouping the patients into homogeneous sets based on psychological inventory scores [6]. Since the notion of clustering is not precisely described, consequently, several algorithms/models have been proposed in the literature, and all of them may result in well different clustering solutions [7, 8].

It is already true that in recent years, a considerable amount of research work conducted is based on investigating the spatio-temporal properties of clustering. In these applications, clusters are no longer considered static objects, as they are affected by changes occurring in the underlying population [9, 15]. The inclusion of new data records to the original population over time may affect the cluster’s memberships, and entirely different clustering solutions may be generated at later time-points. This transition in clustering solutions may include the disappearing of a specific cluster(s), migration of some elements from one cluster to another, splitting of a cluster into several, several clusters splicing together to form one, survival of a cluster and emerging of new ones. The survived clusters can experience internal transition, including changes in location, size, and density [10, 11]. Various topics such as spatio-temporal, evolutionary, stream, and incremental clustering address this issue by adopting the dataset that changes over time. Tracing and understanding the phenomena behind this transition is of practical importance for effective decision-making. This can be helpful in various fields like marketing, fraud detection, networking, scientific publication, health, etc. [12].

In many real-world applications, clustering of the data stream is performed all time to identify the changes occurring in the pattern of the underlying phenomena [13]. Since in a stream, new data items are continually generated, which join the underlying population at a regular interval. Therefore, in order to control part of the data that contributes to the pattern in data mining, the stream needs to be discretized into subsets based on some attributes that have an order. This data discretization into subsets is called the windowing approach and is mainly done based on time. Some of the most commonly used examples are landmark, sliding, and damped window models [14]. These models are discussed in the next section.

[15] introduces the notion of evolutionary clustering to process the times-tamped dataset by producing a sequence of clustering solutions. That is, a clustering solution for each time-step of the temporal data. The algorithm optimizes two competing criteria i.e. each clustering in the sequence should be similar to the clustering at the previous time-step, while at the same time should accurately reflect the data arriving during that time-step. This framework is further extended to spectral clustering [16], density-based clustering [17], and Hierarchical Dirichlet Process with the Hidden Markov model [18].

Using a totally online method, Hyde et al. [19] offer an algorithm that clusters the evolving data streams into arbitrary shaped clusters. The approach consists of two stages: the first stage finds micro-clusters in the datasets, and the second step merges these micro-clusters into macro-clusters. In a similar vein, Fahy et al. [20] describe an Ant Colony Stream Clustering technique built on a density-based methodology that recognises clusters as a collection of micro-clusters. To read a stream and create micro-clusters in the window pane, the method uses a tumbling window model. By combining the related clusters based on a similarity index, these clusters are then further refined. Fahy and Yang [21] further enhance this technique to address the multi-density issue in the density-based clustering strategy. This method uses the local radius of each cluster to identify clusters, and it then tracks changes in the solutions. For the first time, multiple view clustering challenges are addressed by Huang et al. [22] in MVStream clustering method. In order to assign cluster labels to the data items that include summary statistics, this technique creates support vectors from various views of the data objects. Similarly, some studies have been conducted for measuring the similarities between the trajectory in the dynamic environment [2325].

2 Window models

In a landmark window model, all items that arrive after some specific time-point (landmark time) are maintained and cannot be discarded irrespective of window size. The window size is uncontrolled and keeps increasing as time progresses [26, 27]. The data records that arrives in the interval (ti−1, ti) are accumulated according to the equation given by:

Dt=i=1tdi,i=1,2,...,n (1)

where n is the number of time-points and t is the current time-point. Implementation of the landmark window model will generate n window panes, where each pane contain data items evolving from starting time-point t1 to the current time-point ti.

The sliding window model, on the other hand, is based on a fixed size of window w that contains only those objects falling in the interval [tiw + 1, ti], while older cases are discarded. In such type of model, as time progress, the window slides forward while keeping its size w by including new data records and discarding the older ones [27, 28]. The scenario of the sliding window model can be described in the equation below:

D1=d1 (2)
D2=i=1wdi (3)
D3=i=2w+1di (4)
Dm=i=n-w+1ndi (5)

where m is the number of window panes and is equal to nw + 2, n is the number of time-points, and w is the sliding window size.

3 The change detection algorithm

In order to monitor and trace the evaluation of clusters extracted from re-clustering of cumulative datasets [29] introduced a framework known as ‘MONIC’ algorithm. This algorithm is based on clustering cumulative datasets arriving at discrete time-points t1, t2, …, tn. Initially, the data is collected at time-point t1, and as time progresses new data records join the data set at regular interval of time. The initial datasets d1, d2, …., dn, are accumulated and re-clustered at each time-point t1, t2, …, tn to monitor and detect the cluster evolution over time.

The algorithm is mainly based on the idea of a non-symmetric overlap matrix between two clustering extracted from cumulative datasets at two different time-points. Let ξi={X1,X2,,Xk1} be a set of clusters extracted from dataset Di at time point ti and is referred to as first clustering. Similarly, let ξj={Y1,Y2,,Yk2} be a set of clusters extracted from dataset Dj at time point tj (i<j) and is referred to as second clustering. Then the overlap matrix can be defined as:

overlapXi,Yj=XiYjXii=1,2,,k1,j=1,2,,k2 (6)

where k1 is the number of clusters from the first clustering ξi, and k2 is the number of clusters from second clustering ξj. This will generate a matrix of order k1*k2, where rows and columns describe first and second clustering respectively. The value on the corresponding element of the matrix represents the similarity index between cluster Xi and Yj. The MONIC framework assumes hard clustering where each observation belongs to one and only one cluster [30].

In the context of this algorithm, the transition is the change experienced by a cluster Xi ϵ ξi, when it has been perceived at second clustering ξj. This change in the clustering solution is referred to as an external or internal transition. External transition concern the relationship of cluster found at clustering ξi to the clusters found at clustering ξj, whereas internal transition is regarded as changes that occurred in the structure of the survived clusters.

The external transition is categorized into five categories i.e. Survive, Merge, Split, Disappear, and Emerge candidates. The cluster Xl ϵ ξi may survive into Ym ϵ ξj, clusters {Xl1,Xl2}ξi may merge to form Ym ϵ ξj, or cluster Xl ϵ ξi may split into various daughter clusters {Ym1,Ym2}ξj. If a cluster Xl ϵ ξi does not experience any of the above transitions, then it disappears. Similarly, if a cluster Ym ϵ ξj is not a result of any external transition from its ancestors, then it is a newly emerged candidate. The overlap between Xl ϵ ξi and Ym ϵ ξj serve as an indicator of identifying the external transition experienced by clusters at clustering ξi. This value is compared with a minimum threshold value say τϵ[0.5, 1] to identify match of X ϵ ξi in Y ϵ ξj. A cluster Xl ϵ ξi is said to survive in Ym ϵ ξj if this is the only cluster that has an overlap of greater than τsurvive. If at least two clusters from X ϵ ξi (such as Xl1andXl2 have an overlap of greater than τsurvive with Ym ϵ ξj), then it is a case of merge i.e. X1 and X2 merge to form Ym. Furthermore, a cluster is said to split in daughter clusters, if the overlap of Xl with Ym1 and Ym2 is greater than τsplit and collectively their overlap is greater than τsurvive, i.e. for split the following two conditions are required.

Overlap(Xl,Ym)>τsplitm=1,2,...M (7)
m=1MOverlap(Xl,Ym)>τsurvive (8)

where M is the number of daughter clusters from second clustering.

The overlap can not be used as an indicator for monitoring the changes in the form of survived clusters. The shift in the location of the survived cluster (XlYm) can be traced by calculating Euclidean distance between their centroids normalized by the minimum radius. This information can be summarized in the following formula:

location.difference=d(Xl¯,Ym¯)min(rX,rY) (9)

where Xl¯ and Ym¯ are the centroids of clusters Xl and Ym respectively, and d(Xl¯,Ym¯) is the Euclidean distance between them. The r denotes radius of the corresponding clusters and is computed as the maximum distance of an object from its cluster centroid. If the absolute value of location.difference is greater than τlocation, then the algorithm will detect a shift in location of the survived cluster.

For density transition, the average distance of objects from cluster centroid can be computed. The formula for the density of cluster is given by:

avgDistance=1|Xl|i=1nl(Xli-Xl¯) (10)

The difference in density of cluster Xl survived in Ym is normalized by the minimum radius i.e.

density.difference=avgDistanceX-avgDistanceYmin(rX,rY) (11)

If the absolute value density.difference is less than τdensity then there is no change in density of the survived cluster. On the other hand, if the absolute value is greater than τdensity then a change in density would be detected. If density.difference is positive then the cluster is more compact than its ancestors, otherwise, it becomes more diffuse.

4 Package description

The state-of-the-art “MONIC” algorithm is implemented in the R-software via package clusTransition. The package can be used for tracing and monitoring the evolution of clustering solutions in cumulative datasets over time. In this section, we briefly describe the functions and methods exported by the package in detail. Fig 1 below demonstrates the workflow of the package.

Fig 1. Workflow diagram of the package.

Fig 1

The Transition function exported by the package offers three different options for importing datasets. The function then trace changes in clustering solutions.

Table 1 below summarizes the functions, methods, and classes exported by the package along with its corresponding arguments and slots.

Table 1. Functions, methods and classes exported by the package clusTransition.

Name Type Description
Monitor cluster evolution Transition(listdata, …) Function Implements the change detection algorithm and trace the evolution of clusters over time. Return an object of class MONIC.
OverLap class new(“OverLap”) S4 class Class containing cluster representatives and overlap matrices
Overlap Overlap(object, e1, e2) S4 method Method for initializing slots OverLap class
plot cluster evolution moplot(object) Function plot the MONIC class

More details about these functions and classes are described below.

4.1 Function Transition()

The evolution of clusters can be traced using the primary function Transition(), which exports an object of class S4. In implementing the package clusTransition, we have considered the portability of the functions for various types of hard clustering algorithms. A typical call to the Transition() function involves three essential pieces: the data input (listdata, listclus, overlap), choice of window swSize, and the threshold parameters. The user must only provide the swSize and k arguments in case of importing datasets using the listdata argument. This function has the following interface:

>Transition(listdata, listclus = NULL, Overlap = NULL, swSize = 1, typeind = 1,

 + survival_thresHold = 0.8, split_thresHold = 0.3, location_thresHold = 0.3,

 + density_thresHold = 0.3, k)

We took into account the portability of the functions for many kinds of hard clustering algorithms while developing the clusTransition package. For this purpose three different options i.e. listdata, listclus, and Overlap are provided for importing the data.

The listdata imports the raw data stream at discrete time points t1, t2, …, tn. A sequence of cluster solutions are generated from the stream using k-means clustering algorithm. Each element of the list corresponds to the dataset at a single time point. The number of clusters in each accumulative data matrix is specified by the argument k.

On the other hand, the listclus argument imports the clustering solutions at successive time-points to allow clusters other than k-means. Each element of listclus is a nested list that contain clustering solutions at corresponding time point i.e. ξi={X1,X2,,Xki}.

Overlap is a List of numeric matrices containing similarity measures between clusters extracted at consecutive time points. The similarity between clusters are computed using Eq 6. The Overlap method exported by the package can be used to compute the similarity matrices.

swSize indicates size of the sliding window model. The default value of swSize = 1 implements the landmark window model and discretize the stream according to Eq 1. Whereas other numeric values discretize the stream using a sliding window scenario according to Eq 5. The sliding window size can only be provided if listdata argument is chosen.

The survival_thresHold, split_thresHold, location_thresHold, and density_thresHold are minimum threshold value for survival of clusters from Xϵξi to Yϵξj, split of cluster Xϵξi to {Ym1, Ym2}ϵξj, shift in location, and changes in density of survived clusters respectively. These are user defined parameters and belongs to the interval (0,1).

One of the most perplexing problems with most clustering algorithms is deciding the ideal number of partitions. This is a crucial parameter for partitioning, hierarchical and model-based clustering algorithms. The number of clusters one wants to generate from a dataset has to be predefined. There are several ways of estimating the optimal number of clusters k, such as the silhouette, Gap, and Elbow methods. k is a numeric vector containing the relevant number of clusters at the corresponding time-point. The length of k is to be determined from the swSize. This argument should only be provided if listdata argument is chosen.

Typing the object’s name comprising the Transition() function’s output will produce external and internal transition results at each time point. External transition includes the number of clusters still existent, absorbed by others, split into various, disappeared and newly emerged at second clustering. Internal transition comprises changes in the location and density of the survived clusters.

Along with this information, the Monic object holds the cluster’s radius, membership, and distance between cluster centres.

4.2 OverLap class

This is an object of class OverLap that contains summaries of first and second clustering. This object has eight slots that work as input for tracking the evolution of clusters by the Transition() function. The slots include a numeric matrix containing the similarities between clusters generated at first and second clustering (Overlap computed from Eq 6), the cluster’s membership vector, radius, centres, and an average distance of items from the cluster’s centres (computed from Eq 10). In addition, this has the following interface:

>obj <- new(“OverLap”)

4.3 Overlap method

This method initializes the slots of an object having class OverLap by importing the clustering solution ξ of cumulative datasets D at two consecutive time points i and j. Clusters at each data point should be provided as a list of matrices, where each matrix contains a data set belonging to one cluster. It has the following interface.

>Overlap <- Overlap(object, e1 = C1, e2 = C2)

where e1 is the set of clusters ξi=X1,X2,...,Xk1 obtained at time point ti from cumulative dataset Di, e2 is the set of clusters ξj=Y1,Y2,...,Yk2 obtained at time point tj from cumulative dataset Dj, and object is an object of class OverLap.

4.4 Function moplot()

This method plot 3 bar-plot and 1 line graph. The first stack bar-plot shows SurvivalRatio and AbsorptionRatio, second bar-plot shows number of new emerged clusters at each time stamp, third bar-plot shows number of disappearance at each time stamp. The line graph shows passforward Ratio and SurvivalRatio.

> plot(obj)

5 Simulation example

Let us assumes that a data stream consist of datasets d1, d2, …, dn arriving at corresponding time-points t1, t2, …, tn respectively. For the generation of initial dataset d1, we use a generator that takes into account the number of clusters (k), size of each cluster, and separation value between theme [31]. While the generator for generating other streams like d2, d3, …, dn consider the center of each cluster, size of each cluster, and the co-variance structure between them as input [32, 33].

As a working example, we generate a data stream sprouting at four consecutive time points.Fig 2 below demonstrates the scenario for generating datasets di, i = 1, 2, 3, 4 at four time points. The new objects joining the underlying population are shown by red color whereas older records are displayed by black color.

Fig 2. Data stream generated at four discrete time stamps.

Fig 2

The new data items at each time stamp is shown by the red color, whereas the older data items are shown by black color.

6 Pre-processing

Prior to the implementation of the change detection algorithm in cluster solutions over time, the user needs to pre-specify some relevant parameters. First of all, the user needs to decide a suitable windowing approach for the accumulation of datasets evolving at successive time points. For this purpose, we offered two types of windowing approaches in the package i.e. landmark and sliding window models. Implementation of the windowing approach will accumulate the datasets at corresponding time points according to the chosen model and will generate window panes at successive time points. In the second phase, the optimal number of clusters in each window pane Di at the corresponding time point must be determined using an appropriate technique. For illustration purposes, we use worked examples based on the datasets simulated in section IV. The datasets are accumulated according to the landmark and sliding windowing approaches, and then the optimal number of clusters was estimated in each window pane Di.

The implementation of the landmark window model will produce four window panes. Each pane will contain the datasets generated between [t1, ti], where ti represent the current time point. Table 2 below demonstrates the number of objects and optimal number of clusters in each window pane Di estimated from Gap statistics at corresponding time points ti.

Table 2. Optimal number of clusters in landmark window model datasets.

Time points t 1 t 2 t 3 t 4
Window panes D 1 D 2 D 3 D 4
Number of objects (ni) 20,000 32,000 38,000 41,000
Number of clusters (ki) 4 4 5 4

Table discretize the data stream according to the landmark window model explained in Eq 1. The landmark window model provide n window panes of cumulative datasets.

Similarly, the implementation of a sliding window of size 3 will generate 3 window panes. Table 3 below demonstrates the number of objects and optimal number of clusters in each window pane Di.

Table 3. Optimal number of clusters in sliding window model datasets.

Time points t 1 t 2 t 3
Window panes D 1 D 2 D 3
Number of objects (ni) 20,000 38,000 21,000
Number of clusters (ki) 4 5 6

Table discretize the data stream according to the sliding window model explained in Eq 5. The landmark window model provide n-w+2 window panes of cumulative datasets, where w is the size of sliding window.

7 Implementation of function Transition()

In this section implementation of the primary function, Transition() is presented using working examples. The data stream simulated in section 5 is used for monitoring the cluster evolution over time. The function provides three different options for importing the datasets, which are explained in subsections below.

7.1 Looking at listdata argument

The argument listdata is a list of matrices or data frames containing the datasets d1, d2, …, dn evolving at corresponding time-points t1, t2, …, tn. The ith element of the listdata comprises set of data items di that evolve at corresponding time point ti. At this point the Transition() function accumulates the datasets di according to the suitable windowing approach provided in swSize argument. The default value i.e Swsize = 1 will implement landmark window model, whereas other integer values implements sliding window model. The accumulation of datasets di will generate window panes Di that contain cumulative datasets at successive time points. Each window pane Di will be re-clustered by using cclust() function from flexclust package [34]. The optimal number of clusters in cumulative datasets Di should be decided by the user and must be imported via argument k of the function. Both k and swSize arguments are used only if listdata option is chosen for importing datasets di. The argument typeind = 1 allows the user to implement listdata argument. Monitoring and tracking the evolution of clusters using the landmark window model is shown in the example below.

7.1.1 Example (listdata argument with landmark window model)

The default value of swSize = 1 implements the landmark window model and generates n window panes of cumulative datasets Di according to Eq 1. In this working example, the datasets generated in section 5 is used. According to Table 2 in this simulated example window panes D1, D2, D3, and D4 comprises of 4, 4, 5, and 4 clusters respectively. Hence the Transition() function with arguments listdata = listdata, swSize = 1, typeind = 1, Survival_thrHold = 0.8, Split_thrHold = 0.3, and k = c(4,4,5,4) can be implemented as:

>library(clusTransition)

>listdata <- list(d1, d2, d3, d4)

>clusterTrace <- Transition(listdata = listdata, swSize = 1, typeind = 1,

+ Survival_thrHold = 0.8, Split_thrHold = 0.3, k = c(4,4,5,4))

This will generate two tables, displaying the number of clusters experiencing external and internal transition at successive time points. The first table in the output comprises the number of clusters that experience external transitions at corresponding time points tj. Similarly, the second table comprises the number of survived clusters that undergone internal transitions at corresponding time points. Hence the full summary of external and internal transitions are shown below.

The object clusterTrace returned by the Transition() function is an object of class S4, named Monic. The object contains the candidates that experience external and internal transitions at successive time points. The slots ending with x represent candidates that adopt external transitions from first clustering ξi. Whereas the slots ending with y represent the candidates that evolve as a result of corresponding external transition at second clustering ξj. For example, the candidates that experience external transitions at time point t3 can be retrieved as:

graphic file with name pone.0278146.e027.jpg

Let Cimϵξi(first clustering) be the cluster that experience some external transition and evolve as Cjnϵξj(second clustering). Where the first subscript (i and j) represent time point and second subscript (m and n) represent the cluster number. The Time Step [3]] in the output represents the time point tj at second clustering, and hence the time point ti (i = j − 1) at first clustering ξi is one less. So in this particular example i = 2 and j = 3, then the above transition can be summarized as:

The algorithm detect that three clusters survive (C21C31, C23C34, and C24C32) and one cluster split (C22→{C33, C35}).

7.1.2 Example (listdata argument with sliding window model)

In case one is interested in sliding window model, where older records are discarded with the progression of time. This can be achieved by utilizing swSize argument. Here in this synthetic example swSize = 3 will generate window panes that contain datasets arrives in the interval [ti − 3 + 1, ti]. Analysis of Table 3 demonstrates that the number of clusters in window panes D1, D2, and D3 are 4, 5, and 6 respectively. Hence the Transition() function with arguments listdata = listdata, swSize = 3, typeind = 1, Survival_thrHold = 0.8, Split_thrHold = 0.3, and k = c(4,5,6) can be implemented as:

>clusterTrace <- Transition(listdata = listdata, swSize = 3, typeind = 1, + Survival_thrHold = 0.8, Split_thrHold = 0.3, k = c(4,5,6))

7.2 Looking at listclus argument

The listdata argument permit the users to implement un-clustered datasets d1, d2, …, dn arrives at time-points t1, t2, …, tn. However, this restricts the package to only one type of clustering algorithm i.e. k-means algorithm. In order to make the package more flexible for other types of hard clustering, an alternate argument listclus is provided in the function. The listclus argument imports clustering solutions of each window pane as a list i.e. listclus = {ξ1, ξ2, …, ξn} and compute the similarity indices between them. The argument listclus is a list, where every individual element is a nested list of matrices or data-frames. The ith element corresponds to the set of clusters ξi={X1,X2,,Xki} extracted at time-point ti, by implementation of an appropriate clustering algorithm to window pane Di. This is explained in the example given below.

7.2.1 Example: Listclus argument

Prior to applying Transition() function, the user need to extract clusters from each window pane Di. For this purpose, first of all, accumulate the initially collected datasets d1, d2, …, dn, according to a suitable window model like landmark in this example. This can be done by explicitly calling merge() function from base package. By running the R codes given below will generate 4 panes.

>D1 <- d1

>D2 <- merge(d1, d2, all.x = TRUE, all.y = TRUE)

>D3 <- merge(D2, d3, all.x = TRUE, all.y = TRUE)

>D4 <- merge(D3, d4, all.x = TRUE, all.y = TRUE)

Fitting of clustering algorithm

Afterward, choose the relevant number of clusters from each window pane Di, and extract clusters by implementing an appropriate clustering algorithm. Save this clustering solution as a list of matrices or data frames. For illustration purposes, we obtain 4, 4, 5, and 4 clusters from datasets D1, D2, D3, and D4 respectively.

>set.seed(100)

>fit1 <- kmeans(D1, 4)

>C1 <- list()

>for(i in 1:4)C1[[i]] <- D1[fit1$cluster == i,]

where C1 = {C11, C12, C13, C14} is a list of clusters extracted from D1 at time point t1. Similarly, extract clusters from all window panes at corresponding time point as:

>fit2 <- kmeans(D2, 4)

>C2 <- list()

>for(i in 1:4)C2[[i]] <- D2[fit2$cluster == i,]

>fit3 <- kmeans(D3, 5)

>C3 <- list()

>for(i in 1:5)C3[[i]] <- D3[fit3$cluster == i,]

>fit4 <- kmeans(D4, 4)

>C4 <- list()

>for(i in 1:4)C4[[i]] <- D4[fit4$cluster == i,]

Combine all these lists of clustering solutions in a single list and apply Transition() function with arguments listclus = listclus, typeind = 3, Survival_thrHold = 0.8, Split_thrHold = 0.3 as:

>listclus <- list(C1, C2, C3, C4)

>clusterTrace <- Transition(listclus = listclus, typeind = 3,

+ Survival_thrHold = 0.8, Split_thrHold = 0.3)

7.3 Looking at Overlap argument

The Overlap argument also permits the user to implement other types of clustering algorithms and trace the evolution of clusters over time. Overlap argument imports a list of objects as produced by the Overlap() method that contain similarity between clustering obtained at successive time points ti and tj (i < j) and the summaries of these clusters. This can be implemented by setting typeind = 2. The overlap matrices can be computed by utilizing the S4 method overlap() exported by the clusTransition package. In the same way as listclus, some clustering algorithm can be applied to landmark or sliding window modeled dataset to extract the cluster memberships at corresponding time-points. List of clusters extracted from Di and Di−1 can be used to compute the overlap matrix between clustering. This is elaborated in the working example given below.

7.3.1 Example: Overlap argument

Let C1 = {C11, C12, C13, C14}, C2 = {C21, C22, C23, C24}, C3 = {C31, C32, C33, C34, C35}, and C4 = {C41, C42, C43, C44} be the set of clustering solutions obtained from corresponding datasets D1, D2, D3, and D4. These sets of clustering solutions are already obtained in the previous example. Then the objects of class OverLap can be created and initialized as:

>obj <- new(“OverLap”)

>Overlap1 <- Overlap(obj, e1 = C1, e2 = C2)

>Overlap2 <- Overlap(obj, e1 = C2, e2 = C3)

>Overlap3 <- Overlap(obj, e1 = C3, e2 = C4)

Combine all these objects in a list and apply Transition() function with arguments Overlap = Overlap, typeind = 2, Survival_thrHold = 0.8, Split_thrHold = 0.3 as:

>Overlap <- list(Overlap1, Overlap2, Overlap3)

>clusterTrace <- Transition(Overlap = Overlap, typeind = 2,

 + Survival_thrHold = 0.8, Split_thrHold = 0.3)

7.4 moplot() function

Fig 3 displays the graphical summary of an object of class Monic generated by Transition() function as output. The stack bar-plot in the top left corner displays the survival and absorption ratio at successive time points. The Figure illustrates that all clusters survived at time point t1, and hence the survival ratio is 1. However, at time point t2 3 out of 4 clusters survived resulting in a 0.75 survival ratio. Similarly at time point t3 3 out of 5 clusters survive, while 2 merged. This resulted in 0.60 survival and 0.40 absorption ratios respectively. Consequently, no cluster disappears and no newly emerged candidate were detected at any of the time points. This can be seen from pass-forward ratio, which is unity at all time points except t2 where one cluster splits into daughter candidates.

Fig 3. Data stream generated at four discrete time stamps.

Fig 3

The new data items at each time stamp is shown by the red color, whereas the older data items are shown by black color.

8 Real data example

To demonstrate the practicality of the package and deeply understand applications of cluster evolution, we investigate three real-life datasets. To comprehend the notion of transformation in social, political, and moral attitudes of European nations; the Human Values datasets were extracted from European Social Surveys [35]. The changes in electricity consumption of inhabitants were traced using Individual Household Electricity Consumption dataset. Similarly, the Intel Lab sensors streaming dataset was used to show the applications of the framework. Both these data streams were extracted from the home page of “UCL Machine Learning Repository”.

8.1 Application to human values scale

As a case study, we extract eight datasets each corresponds to a single round of European Social Surveys (ESS) conducted in years 2002, 2004, 2006, 2008, 2010, 2012, 2014, and 2016 respectively. The dataset consist of 25024 individuals who respond to the Schwartz Value Survey (SVS) for computing basic human values and can be downloaded from the URL https://ess-search.nsd.no/CDW/ConceptVariables. The ten basic values are Benevolence, Universalism, Self-direction, security, Confirmatory, Hedonism, Achievements, Traditions, Stimulation, and Power [35]. The k-means clustering algorithm was implemented to sliding window-modeled datasets at each time point. Whereas, the number of clusters in the respective datasets was estimated from the well-known GAP statistic. Fig 4 below describe the evolution of clusters at time point ti, i = 1, 2, 3, 4, 5, 6, 7 in Human Value scale datasets. which demonstrates that two clusters C11 and C12 survived over time. The first imperative cluster was C11(C11C22C32C42) that emerged at t1(2002) and survived until t4(2006, 2010). However, the cluster survived till 2010, but experienced internal transition and became more diffused eventually disappeared at time-point t5. The second vibrant cluster was C12(C12C24C33C41C52C63C71) which survive through the entire time span. This was the most important cluster because not only it survives over time but also turns out to be denser. Mostly the new respondents of SVS surveys over the years joins this cluster. The shift in location was observed for this cluster at time-point t2 and t3, and afterward, remain stable. The first external transition was experienced in the cluster C14 which split into two clusters and ultimately disappeared. The algorithm also detects a cluster C61 that emerged at t6(2010, 2014) and pass-forward while absorbing elements of the cluster C62.

Fig 4. Transition of clusters in basic human values datasets.

Fig 4

8.2 Application to Individual Household Electric Power Consumption

As a second example, the Individual Household Electric Power Consumption dataset for the years [2006, 2010] was used. This dataset comprises of 2075259 households characterized by seven numerical attributes. The dataset is available at machine learning repository [36] and can be downloaded from https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption. A sliding window model of size 2 was used for accumulation of the stream at successive time points. In this section, we use the CLARA algorithm to extract clusters from the datasets at successive time points. Whereas the average silhouette method was used to estimate the optimal value of k in each window pane. Fig 5 below demonstrates the evolution of clusters at time point ti, i = 1, 2, 3, 4, 5 in individual household electric power consumption datasets. The algorithm detect that all of the four clusters survive (C11C21, C12C21, C13C23, and C14C24) experiencing internal transition and became diffuse during [2006, 2007]. A shift in location for only one cluster C13 was detected, whereas other clusters were stable to change in location. Similarly, three clusters survive (C21C31, C22C33, and C24C34), one cluster disappear (C23→ ⊙), and one cluster emerged (⊙→C32) during [2007, 2008]. Two of the survive clusters became diffuse, while one cluster became compact than its predecessors. Likewise, one cluster survive (C33C43), three disappears (C31→ ⊙, C32→ ⊙, and C34→ ⊙), and three newly emerged clusters (⊙→C41, ⊙→C42, and ⊙→C44) were detected during [2008, 2009]. Afterwards all four clusters disappears (C41→ ⊙, C42→ ⊙, C43→ ⊙, and C44→ ⊙), and three new clusters emerged (⊙→C51, ⊙→C52, and ⊙→C53) during [2009, 2010].

Fig 5. Transition of clusters in Individual Household Electric Power Consumption datasets.

Fig 5

8.3 Intel Lab dataset

In this section, we used the publically accessible dataset recorded from 54 sensors deployed at Intel research laboratory during February 28th and April 5th, 2004. Each sensor record information on temperature, humidity, voltage, and light every thirty-one seconds. The dataset comprises of 2.3 million readings collected from 54 sensors. The sensors were designed to make it energy-efficient and consume power only in sensing environment and transmitting data. We select only a subset of measurements from this dataset and include readings from sensor-1 only. This subset of the data consists of 43,047 readings from sensor-1 and can be downloaded from the URL https://www.kaggle.com/datasets/divyansh22/intel-berkeley-research-lab-sensor-data.

We accumulate the dataset according to the landmark window model, and as the flow is uniform, so we consider 9000 records per time period. This implementation generates 5 window panes of cumulative datasets. The shadow statistic decided the optimal number of clusters in cumulative datasets at the corresponding time point. The Partitioned Around Medoids (PAM) algorithm was used for extracting clusters from datasets.

Fig 6 below demonstrates the transitions of clusters at time points ti, i = 1, 2, 3, 4, 5 in Intel Lab dataset. The algorithm detect that all six clusters survive (C11C21, C12C22, C13C24, C14C25, C15C26, and C16C23) while one new cluster emerge (⊙→C27) at time point t2. All survived clusters experience internal transition and became more diffuse. Also six clusters survive (C21C31, C22C32, C24C33, C25C34, C26C35, and C27C36) and one cluster disappears (C23→ ⊙) at time point t3. Cluster C24 experience double internal transition i.e. shift in location and change in density, while other clusters only became diffuse. Likewise, five clusters survive (C31C43, C32C45, C34C44, C35C42, and C36C47), one cluster disappears (C33→ ⊙), and two clusters emerged (⊙→C41 and ⊙→C46) at time point t4. Similarly, five clusters survive (C42C54, C43C56, C44C57, C45C53, and C47C55), two clusters merge ({C41, C46}→C51), whereas one cluster emerge (⊙→C52) at time point t5.

Fig 6. Evolution of clusters in Intel Lab dataset.

Fig 6

For further details and understanding the significance and practical applications of monitoring changes in clustering solutions of streaming datasets see Atif et al [37].

9 Concluding remarks

In this paper, we introduce an R package clusTransition dedicated to trace the evolution of cluster solutions in cumulative datasets. The package implements state-of-the-art algorithm MONIC for modeling and tracing the transition of cluster solutions in dynamic datasets. This algorithm is based on re-clustering of cumulative datasets D1, D2, …, Dn arriving at corresponding time-points t1, t2, …, tn and monitor the changes occurring in these cluster solutions. The changes comprise of clusters that still exist, split into various, absorbed by others, disappeared and newly emerged. The clusters that survived in external transition may experience a change in location and density called internal transition. We have applied clusTransition package on synthetic as well as on real-life datasets to look insight into change detection framework.

10 Limitations of the package

The clusTransition package takes into account batch processing, where the stream is discretized and the gathered data is put into the windowing model. The datasets are not clustered upon arrival immediately in real time. Similarly, the use of sliding and landmark models either contain the data items or entirely ignore them at subsequent time-points. A damped window model, on the other hand, assigns each object, depending on its arrival time, exponentially decreasing weights. Future plans call for adding support for the damped window model to the R package for change detection.

The paradigm for cluster transition monitoring presupposes hard clustering, which requires that each item be assigned to one and only one cluster. This assumption implies that the strategy cannot be used to density-based or model-based clustering approaches, leaving the problem open for further investigation.

Data Availability

All relevant data are within the paper. All other data streams used in the manuscript are available in public repositories. DOI for Human Value Scale datasets: doi:10.21338/NSD-ESS-CUMULATIVE. Link for Human Value Scale datasets: https://ess-search.nsd.no/CDW/ConceptVariables Link for household Electric Power Consumption: https://archive.ics.uci.edu/452ml/datasets/individual+household+electric+power+consumption Link for Intel Lab sensor datasets: https://www.kaggle.com/datasets/divyansh22/intel-berkeley-research-lab-sensor-data.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Wierzchoń and M. Kłopotek. Modern Algorithms of Cluster Analysis. Studies in Big Data. Springer International Publishing, 2017. URL: https://books.google.com.pk/books?id=LeJEDwAAQBAJ.
  • 2. Rapkin B.D., Luke D.A. Cluster analysis in community research: Epistemology and practice. Am J Commun Psychol. 1993; 21, 247–277. 10.1007/BF00941623 [DOI] [Google Scholar]
  • 3.H.C. Romesburg. Cluster Analysis for Researchers. Morrisville, NC: Lulu.com. (Reprint of 1984 edition, with minor revisions.); 2004.
  • 4. Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A. Y., et al. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing. 2014; 2(3):267–279. doi: 10.1109/TETC.2014.2330519 [DOI] [Google Scholar]
  • 5. Montinaro M. and Sciascia I. Market segmentation models to obtain different kinds of customer loyalty. Journal of Applied Sciences. 2011; 11(4):655–662. doi: 10.3923/jas.2011.655.662 [DOI] [Google Scholar]
  • 6. Borgen F. H. and Barnett D. C. Applying cluster analysis in counseling psychology research. Journal of Counseling Psychology. 1987; 34(4):456–468. doi: 10.1037/0022-0167.34.4.456 [DOI] [Google Scholar]
  • 7. Punj G. and Stewart D. W. Cluster analysis in marketing research: Review and suggestions for application. Journal of Marketing Research. 1983; 20(2):134. doi: 10.2307/3151680 [DOI] [Google Scholar]
  • 8. Zakharov K. Application of k-means clustering in psychological studies. The Quantitative Methods for Psychology. 2016; 12(2):87–100. doi: 10.20982/tqmp.12.2.p087 [DOI] [Google Scholar]
  • 9. Landauer M., Wurzenberger M., Skopik F., Settanni G., and Filzmoser P. Dynamic log file analysis: An unsupervised cluster evolution approach for anomaly detection. Computers Security. 2018; 79:94–116. doi: 10.1016/j.cose.2018.08.009 [DOI] [Google Scholar]
  • 10.M. Oliveira and J. a. Gama. Mec –monitoring clusters’ transitions. In Proceedings of the 2010 Conference on STAIRS 2010: Proceedings of the Fifth Starting AI Researchers’ Symposium, page 212–224, NLD, 2010. IOS Press. ISBN 9781607506751.
  • 11. Spiliopoulou M., Ntoutsi E., Theodoridis Y., and Schult R. Monic and followups on modeling and monitoring cluster transitions. Advanced Information Systems Engineering Lecture Notes in Computer Science. 2013; 622–626. doi: 10.1007/978-3-642-40994-3_41 [DOI] [Google Scholar]
  • 12. Silva J. D. A., Hruschka E. R., and Gama J. An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Systems with Applications. 2017; 67:228–238. doi: 10.1016/j.eswa.2016.09.020 [DOI] [Google Scholar]
  • 13.S. Badiozamany, K. Orsborn, and T. Risch. Framework for real-time clustering over sliding windows. Proceedings of the 28th International Conference on Scientific and Statistical Database Management—SSDBM 16. 2016. 10.1145/2949689.2949696 [DOI]
  • 14. Patroumpas K. and Sellis T. Window specification over data streams. Current Trends in Database Technology—EDBT 2006 Lecture Notes in Computer Science. 2006; 445–464. doi: 10.1007/11896548_35 [DOI] [Google Scholar]
  • 15.Deepayan Chakrabarti, Ravi Kumar, and Andrew Tomkins. Evolutionary clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’06). 2006. Association for Computing Machinery, New York, NY, USA, 554–560. 10.1145/1150402.1150467 [DOI]
  • 16.Y. Chi, X. Song, D. Zhou, K. Hino, B. L. Tseng. On evolutionary spectral clustering. ACM Trans. Knowl. Discov. 2009. 10.1145/1631162.1631165 [DOI]
  • 17.Y. Zhang, H. Liu, B. Deng. Evolutionary clustering with dbscan. Ninth International Conference on Natural Computation (ICNC). 2013; 923–928. 10.1109/ICNC.2013.6818108 [DOI]
  • 18.T. Xu, Z. Zhang, P. S. Yu, B. Long. Evolutionary clustering by hierarchical dirichlet process with hidden markov state. Eighth IEEE International Conference on Data Mining. 2008; 658–667. 10.1109/ICDM.2008.24 [DOI]
  • 19. Hyde R., Angelov P., and MacKenzie A.R. Fully online clustering of evolving data streams into arbitrarily shaped clusters. Information Sciences. 2017; 382: 96–114. [Google Scholar]
  • 20. Fahy C., Yang S., and Gongora M. Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams. IEEE Trans. Cybern. 2019; 49: 2215–2228. doi: 10.1109/TCYB.2018.2822552 [DOI] [PubMed] [Google Scholar]
  • 21.C. Fahy and S. Yang. Finding and tracking multi-density clusters in online dynamic data streams. IEEE Trans. Big Data. 2019; 1–15. 10.1109/TB-DATA.2019.2922969 [DOI]
  • 22. Huang L., Wang C.-D., Chao H.-Y., and Yu P.S. MVStream: Multiview data stream clustering. IEEE Trans. Neural Netw. Learn. Syst. 2020; 31: 3482–3496. doi: 10.1109/TNNLS.2019.2944851 [DOI] [PubMed] [Google Scholar]
  • 23. Li H., Liu J., Yang Z., Liu R. W., Wu K., and Wan Y. Adaptively constrained dynamic time warping for time series classification and clustering. Information Science. 2020; 534: 97–116. doi: 10.1016/j.ins.2020.04.009 [DOI] [Google Scholar]
  • 24. Liang M., Liu R. W., Li S., Xiao Z., Liu X., and Lu F. An unsupervised learning method with convolutional auto-encoder for vessel trajectory similarity computation. Ocean Eng. 2021; 225: 108803. doi: 10.1016/j.oceaneng.2021.108803 [DOI] [Google Scholar]
  • 25.Z. Zhang, K. Huang, and T. Tan. Comparison of Similarity Measures for Trajectory Clustering in Outdoor Surveillance Scenes. Proceedings of the 18th International Conference on Pattern Recognition (ICPR06), IEEE, Hong Kong, China, 2006: 1135–1138.
  • 26. Liu X., Guan J., and Hu P. Mining frequent closed itemsets from a landmark window over online data streams. Computers Mathematics with Applications. 2009; 57(6):927–936. doi: 10.1016/j.camwa.2008.10.060 [DOI] [Google Scholar]
  • 27. Mansalis S., Ntoutsi E., Pelekis N., and Theodoridis Y. An evaluation of data stream clustering algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal. 2018; 11(4):167–187. doi: 10.1002/sam.11380 [DOI] [Google Scholar]
  • 28. Hu Y. Optimal algorithm of data streams clustering on sliding window model. Journal of Computer Applications. 2008; 28(6):1414–1416. doi: 10.3724/SP.J.1087.2008.01414 [DOI] [Google Scholar]
  • 29.M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining—KDD 06. 2006. 10.1145/1150402.1150491 [DOI]
  • 30. Ntoutsi E., Spiliopoulou M., and Theodoridis Y. Fingerprint. International Journal of Data Warehousing and Mining. 2012; 8(3):27–44. doi: 10.4018/jdwm.2012070102 [DOI] [Google Scholar]
  • 31. Qiu W. and Joe H. Generation of random clusters with specified degree of separation. Journal of Classification. 2006; 23(2):315–334. doi: 10.1007/s00357-006-0018-y [DOI] [Google Scholar]
  • 32.Weiliang Qiu and Harry Joe. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). R package version 1.3.7. https://CRAN.R-project.org/package=clusterGeneration
  • 33. Melnykov Volodymyr, Chen Wei-Chen, Maitra Ranjan. MixSim: An Package R for Simulating Data to Study Performance of Clustering Algorithms. Journal of Statistical Software. 2012; 51(12): 1–25. URL 10.18637/jss.v051.i1223504300 [DOI] [Google Scholar]
  • 34. Leisch F. A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis. 2006; 51(2):526–544. doi: 10.1016/j.csda.2005.10.006 [DOI] [Google Scholar]
  • 35.European Social Survey Cumulative File, ESS 1-9 (2020). Data file edition 1.0. Sikt—Norwegian Agency for Shared Services in Education and Research, Norway. Data Archive and distributor of ESS data for ESS ERIC. 10.21338/NSD-ESS-CUMULATIVE [DOI]
  • 36.Dua, D. and Graff, C. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  • 37. Atif Muhammad, Shafiq Muhammad and Leisch Friedrich. Applications of monitoring and tracing the evolution of clustering solutions in dynamic datasets. Journal of Applied Statistics. 2021; doi: 10.1080/02664763.2021.2008882 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Mohammad Mehdi Rashidi

12 Aug 2022

PONE-D-22-20071clusTransition: An R Package for Monitoring Transition in Cluster Solutions of Temporal DatasetsPLOS ONE

Dear Dr. Atif,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 26 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Mohammad Mehdi Rashidi

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper proposes a R package that implementing MONIC framework for clustering temporal datasets. It has some merits to be published. However, it still has some problems.

1. The workflow of the proposed clusTransition should be given;

2. What are the major limitations of this R package?

3. More comparative experiments are required.

4. More important articles should be cited.

5. More standard data sets from UCI should be checked.

Reviewer #2: In this paper, an R package called clusTransition is discussed, and the use of the package is demonstrated. I found the package interesting. However, the presentation needs improvements. I listed my comments below:

1. There is no need to menaiton t1, t2, ..., tn in the abstract. Please consider removing it.

2. In the Abstract, it is mentioned that "The contribution of this paper is to demonstrate the implementation of the package using synthetic and real-life datasets in R software." This sentence gives the impression that this paper is demonstrating someone else's package. However, the package's author is the first author of the manuscript. Please revise this sentence.

3. Please do not give mathematical definitions in the Introduction. Instead, you should discuss the necessity of the package and such a paper to describe it, mention the contributions and finish with an outline paragraph. You can move the details of the method that the package implements to the next section.

4. Please link the function descriptions of the package to the mathematical definition of the methods implemented by the package.

5. It looks like the help documentation of the package is repeated in the manuscript. Since all those points are already given in the documentation, repeating them is unnecessary. Instead, please discuss how to specify the inputs and how to use the outputs in relation to the methods.

6. Manuscript needs to be checked against English language issues.

7. It is mentioned that "we generate a data stream sprouting at four consecutive time points". Please elaborate on the generation of data.

8, It is mentioned that "... we generate a data stream sprouting at four consecutive time points." But in the next sections, there are only two applications: "Application to Human values scale" and "Application to Individual Household Electric Power Consumption."

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 15;17(12):e0278146. doi: 10.1371/journal.pone.0278146.r002

Author response to Decision Letter 0


10 Sep 2022

Thank you for giving me the opportunity to submit a revised draft of my manuscript titled clusTransition: An R Package for Monitoring Transition in Cluster Solutions of Temporal Datasets to PLOS ONE. We appreciate the time and effort that you and the reviewers have dedicated to providing your valuable feedback on my manuscript. We are grateful to the reviewers for their insightful comments on my paper. We have highlighted the changes within the manuscript.

Here is a point-by-point response to the reviewers’ comments and concerns.

Comments from Reviewer # 1:

Comment # 1: The workflow of the proposed clusTransition should be given;

Response: Added to the manuscript with track changes.

Comment # 2: What are the major limitations of this R package?

Response: Added to the manuscript with track changes.

Comment # 3: More comparative experiments are required.

Response: Most of the articles that discuss analysis of streaming data use real-life datasets. Unfortunately, there is no guidance available for conducting comparative experiments. However, Atif. (2021) perform a simulated study to analyze the performance of framework for various clustering parameters.

Comment # 4: More important articles should be cited.

Response: Added to the manuscript with track changes.

Comments # 5: More standard data sets from UCI should be checked.

Response: Thank you for pointing out this, I agree with this comment. Therefore I have added some standard datasets from UCL machine learning repository. So we added the Intel Berkeley Research Lab Sensor Data to the manuscript used by many research articles such as Doreswamy, Narasegouda, S. (2014), Baralis, Cerquitelli and D'Elia (2007), Baralis, Elena & Cerquitelli, Tania & D’Elia, Vincenzo. (2022) etc.

Furthermore I added a citation of the paper that discusses the applications of monitoring changes in clustering solutions using some standard datasets.

Comments from Reviewer # 2:

Comment # 1: There is no need to mention t1, t2, ..., tn in the abstract. Please consider removing it.

Response: Corrected accordingly.

Comment # 2: In the Abstract, it is mentioned that "The contribution of this paper is to demonstrate the implementation of the package using synthetic and real-life datasets in R software." This sentence gives the impression that this paper is demonstrating someone else's package. However, the package's author is the first author of the manuscript. Please revise this sentence.

Response: The statement is rephrased in the manuscript with track changes.

Comment # 3: Please do not give mathematical definitions in the Introduction. Instead, you should discuss the necessity of the package and such a paper to describe it, mention the contributions and finish with an outline paragraph. You can move the details of the method that the package implements to the next section.

Response: All the mathematical definitions and methods used in the package are discussed in the section “Change detection algorithms”. These definitions and mathematical descriptions are removed from the “Introduction” section in the updated manuscript.

Comment # 4: Please link the function descriptions of the package to the mathematical definition of the methods implemented by the package.

Response: Corrected accordingly.

Comments # 5: It looks like the help documentation of the package is repeated in the manuscript. Since all those points are already given in the documentation, repeating them is unnecessary. Instead, please discuss how to specify the inputs and how to use the outputs in relation to the methods.

Response: Corrected in the manuscript with track changes.

Comment # 6: Manuscript needs to be checked against English language issues.

Response: Corrected in the manuscript with track changes.

Comment # 7: It is mentioned that "we generate a data stream sprouting at four consecutive time points". Please elaborate on the generation of data.

Response: The package takes temporal datasets as input and re-clusters them at successive time points. The temporal datasets is not stationary, rather it evolve over time. The temporal datasets have a dedicated attribute known as time-stamp, which record the arrival time of each data record. The stream of data records is discritize by accumulating it at discrete time points denoted by t1, t2, ..., tn respectively.

Comments # 8: It is mentioned that "... we generate a data stream sprouting at four consecutive time points." But in the next sections, there are only two applications: "Application to Human values scale" and "Application to Individual Household Electric Power Consumption."

Response: We have generated a synthetic temporal dataset that evolve at four consecutive time points. The four time points refer to the time-stamps at which data records emerged having same attributes. The does not indicate different datasets. The applications i.e. “Application to Human values scale" and "Application to Individual Household Electric Power Consumption" are also temporal datasets which emerged at 5 time-stamps.

• Doreswamy, Narasegouda, S. (2014). Fault Detection in Sensor Network Using DBSCAN and Statistical Models. In: Satapathy, S., Udgata, S., Biswal, B. (eds) Proceedings of the International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA) 2013. Advances in Intelligent Systems and Computing, vol 247. Springer, Cham. https://doi.org/10.1007/978-3-319-02931-3_50

• E. Baralis, T. Cerquitelli and V. D'Elia, "Modeling a Sensor Network by means of Clustering," 18th International Workshop on Database and Expert Systems Applications (DEXA 2007), 2007, pp. 177-181, doi: 10.1109/DEXA.2007.23.

• Baralis, Elena & Cerquitelli, Tania & D’Elia, Vincenzo. (2022). Technical Report Modeling a Sensor Network by means of Clustering.

• M. Atif. (2021). Monitoring changes in cluster solutions. (Doctoral dissertation). Available from FIS of University of Natural Resources and Life Sciences, Vienna.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Mohammad Mehdi Rashidi

12 Oct 2022

PONE-D-22-20071R1clusTransition: An R Package for Monitoring Transition in Cluster Solutions of Temporal DatasetsPLOS ONE

Dear Dr. Atif,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 26 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Mohammad Mehdi Rashidi

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: The author has responded to all my comments in the previous review round sufficiently. I have one more minor comment. Please consider merging the "Related works" section into the introduction since this section is too small to be a stand-alone section in the manuscript.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 15;17(12):e0278146. doi: 10.1371/journal.pone.0278146.r004

Author response to Decision Letter 1


15 Oct 2022

Reviewer #2: The author has responded to all my comments in the previous review round sufficiently. I have one more minor comment. Please consider merging the "Related works" section into the introduction since this section is too small to be a stand-alone section in the manuscript.

Response: The Related work section is merge with the Introduction section. The changes are highlighted in revised manuscript with track changes.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Mohammad Mehdi Rashidi

11 Nov 2022

clusTransition: An R Package for Monitoring Transition in Cluster Solutions of Temporal Datasets

PONE-D-22-20071R2

Dear Dr. Atif,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Mohammad Mehdi Rashidi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: In this revision round, the author has responded to all my comments in the previous review round sufficiently.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

Acceptance letter

Mohammad Mehdi Rashidi

17 Nov 2022

PONE-D-22-20071R2

clusTransition: An R Package for Monitoring Transition in Cluster Solutions of Temporal Datasets

Dear Dr. Atif:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Mohammad Mehdi Rashidi

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are within the paper. All other data streams used in the manuscript are available in public repositories. DOI for Human Value Scale datasets: doi:10.21338/NSD-ESS-CUMULATIVE. Link for Human Value Scale datasets: https://ess-search.nsd.no/CDW/ConceptVariables Link for household Electric Power Consumption: https://archive.ics.uci.edu/452ml/datasets/individual+household+electric+power+consumption Link for Intel Lab sensor datasets: https://www.kaggle.com/datasets/divyansh22/intel-berkeley-research-lab-sensor-data.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES