Abstract
This study develops a new clustering method for high-dimensional zero-inflated time series data. The proposed method is based on thick-pen transform (TPT), in which the basic idea is to draw along the data with a pen of a given thickness. Since TPT is a multi-scale visualization technique, it provides some information on the temporal tendency of neighborhood values. We introduce a modified TPT, termed ‘ensemble TPT (e-TPT)’, to enhance the temporal resolution of zero-inflated time series data that is crucial for clustering them efficiently. Furthermore, this study defines a modified similarity measure for zero-inflated time series data considering e-TPT and proposes an efficient iterative clustering algorithm suitable for the proposed measure. Finally, the effectiveness of the proposed method is demonstrated by simulation experiments and two real datasets: step count data and newly confirmed COVID-19 case data.
Keywords: Clustering, Multiscale method, Newly confirmed COVID-19 case data, Step count data, Thick-pen transform, Zero-inflated time series data
Introduction
Clustering is a popular unsupervised machine learning technique for identifying patterns and groupings in data, which has been widely used in many domains, including biology, finance, and image processing. However, many real-world datasets, especially those in healthcare, finance, and environmental monitoring, often exhibit zero inflation, which refers to excessive zeros in the data. This characteristic poses significant challenges to traditional clustering algorithms, which assume that the data points follow a specific distribution or pattern. In particular, zero-inflated time series data are prevalent in many domains, such as disease surveillance and financial transaction analyses. For instance, in epidemiology, the counts of infectious diseases are often zero-inflated due to under-reporting, misclassification, and other factors. Various methods have been proposed to address the challenges of clustering zero-inflated time series data, including a zero-inflated Gaussian mixture model (Zhang et al., 2020), a zero-inflated Poisson mixture model (Lim et al., 2014), and a zero-inflated negative binomial mixture model (Yau et al., 2003). These methods aim to model the zero-inflation by adding extra parameters or components to the mixture model and effectively cluster zero-inflated time series data.
In this study, we propose a new clustering method without specific model assumptions that can be applied to various structures of zero-inflated time series data. Selecting an appropriate distance (similarity) measure in time series data clustering is essential. Thus, we propose a similarity measure suitable for zero-inflated time series data inspired by the thick-pen transform (TPT) by Fryzlewicz & Oh (2011). The TPT is a novel way of visualizing time series data at multiple scales using a range of pens with various thicknesses. To improve the temporal resolution of zero-inflated time series data, which is crucial for efficient clustering, we introduce two modifications: the ensemble TPT (e-TPT) and a modified similarity measure called TPMA. These approaches have the advantage of overriding some original properties of the TPT, capturing time series trends of neighboring data points, and reflecting the multi-scale information of the data. Then, we present a clustering algorithm based on the proposed similarity measure.
The primary rationale of the proposed method is that e-TPT can effectively manage the issue of excessive zeros in zero-inflated time series data. To demonstrate this, we present two zero-inflated time series in Fig. 1, where the proportions of zero observations are 0.495 and 0.480. We apply e-TPT with a square pen, as explained in Section 2.1, and obtain the upper boundary of the pen. The lower boundary of the e-TPT for zero-inflated time series data rarely fluctuates; thus, we only consider the upper boundary of the pen. The upper boundary from the pen with a thickness of 100 manifests the global trend of the data from the two time series, and the two series are distinguished. Moreover, there is no zero observation in the upper boundary of the e-TPT, indicating that a simple clustering method may work well without considering the problem of exceeding zero.
Fig. 1.
From left to right, two simulated zero-inflated time series, their e-TPT with a square pen with a thickness of 5, and upper boundaries obtained from the e-TPT with thicknesses of 5 and 100
This study is motivated by two real-world time series. The first comprises data on the number of steps recorded from wearable devices. Figure 2 depicts the step data recorded for three days. As expected, zero values occur frequently, and daily activity patterns are observed. A proper clustering of step data can provide rich information about physical activities and can be further used for personal healthcare services.
Fig. 2.
Step count data for three different days
We consider newly confirmed coronavirus disease 2019 (COVID-19) cases per day in Seoul, Korea, as the second time series dataset. South Korea had its first confirmed COVID-19 case in January 2020. As of February 2022, the cumulative number of confirmed cases was more than 2,665,000. Figure 3 illustrates the number of new COVID-19 cases per day in three districts in Seoul from February 5, 2020, to June 18, 2021. Before November 2020, few new cases of COVID-19 were confirmed in all three districts, but the number of confirmed cases suddenly increased in the winter of 2020. The days with zero confirmed cases are 51.6%, 31.6%, and 50%, respectively. This data analysis aims to observe the time series patterns of confirmed cases that vary from district to district and cluster the 25 districts in Seoul based on the patterns of confirmed COVID-19 cases per day. Recently, many COVID-19-related studies have been conducted, and the number of deaths or confirmed cases is modeled using zero-inflated time series models. For example, Tawiah et al. (2021) analyzed the trend of a daily count of COVID-19 deaths in Ghana using a zero-inflated Poisson autoregressive model and a zero-inflated negative binomial autoregressive model.
Fig. 3.
Newly confirmed COVID-19 cases per day in three districts of Seoul from February 5, 2020 to June 18, 2021
The remainder of this paper is organized as follows. Section 2 introduces an e-TPT and proposes a new similarity measure based on the e-TPT. In addition, the proposed clustering method and its practical algorithm are presented. In Section 3, a simulation study is conducted to evaluate the empirical performance of the proposed method. Next, Section 4 discusses the real-data analysis with two real datasets: step-count data and newly confirmed COVID-19 cases data. The concluding remarks are provided in Section 5.
Proposed Clustering Procedure
Ensemble Thick-Pen Transform
The TPT is based on the idea of drawing along time series data points with a pen with a shape and thickness. We let denote a set of thickness parameters. The TPT of a real-valued univariate process is defined as the following sequence of boundary pairs:
where and represent the lower and upper boundaries of the area covered by a pen of thickness at time t, respectively. As for the pen shape, Fryzlewicz & Oh (2011) considered square and round shapes as follows.
- Square pen
- Round pen
Above, denotes the set of integers, and represents the scaling factor defined to adjust the difference between the thickness of the pen and the data variability. As Fryzlewicz & Oh (2011) suggested, is always set to unless otherwise stated.
The TPT has a multi-scale feature of viewing data at different distances according to the thickness of the pen. Specifically, applying large values corresponds to zooming out and coarsely viewing data trends, whereas small values sensitively capture the original features. Further, this transformation is visually intuitive and informative. Figure 4(a) and (b) display the boundaries obtained by applying a square pen with thicknesses of and 80, respectively, to the step data recorded on a specific day. The data trend with is coarser than that with .
Fig. 4.
(a) and (b) TPT with thicknesses of 30 and 80. (c) and (d) e-TPT with thicknesses of 30 and 80. The original step data are plotted with a solid black line
In this study, we consider a variation of the TPT to obtain a smooth version of the thick-pen boundaries, enhancing the temporal resolution of the time-series data and making the proposed clustering performance more effective. Thus, we define the upper and lower ensemble boundaries of a real-valued univariate process with a square pen with a thickness of as
which are the ensemble means of boundaries created with different starting points s. Thus, the ensemble TPT (e-TPT) of is defined as the sequence of pairs of the ensemble boundaries,
Using the average value of the boundaries, the ensemble TPT provides smoother boundaries than the conventional TPT and is less sensitive to the initial data values and outliers. Figure 4(c) and (d) illustrate the ensemble boundaries with thicknesses of 30 and 80, where the boundaries are much smoother than the conventional ones in panels (a) and (b).
Similarity Measure for Clustering
This section proposes a similarity measure employed as the input variable for clustering zero-inflated time-series data. For this purpose, we consider the thick-pen measure of association (TPMA) between the two time series X(t) and Y(t) proposed by Fryzlewicz & Oh (2011). Suppose that X(t) and Y(t) are on approximately the same scale. The TPMA is then defined as
| 1 |
Moreover, , and holds when an overlap exists between the two boundaries, whereas when a gap exists between the two boundaries. This idea of measuring time series dependence based on the overlap or gap of pen areas is intuitively recognized through the visualization of transformations.
To reflect the characteristics of zero-inflated time series data, we propose a new similarity measure based on e-TPT and TPMA. From now on, we assume that the given time series data are nonnegative and zero-inflated. Then, the lower boundary of e-TPT for zero-inflated time series data rarely fluctuates. Therefore, it is natural to modify the TPMA measure of (1) to set the lower boundary of the pen to zero. Then, the modified TPMA measure, TPMA, is defined as
| 2 |
for each time and its geometric mean over time has been proposed as a measure to assess the similarity between two time series. Here, measures the intersection length between and as a proportion of the union’s size of these two intervals. Thus, holds for . This measure returns a value close to 1 when the two time series are similar at time t. It is noticeable that the e-TPT transformation can affect the ratio due to the pen thickness. For example, the ratio is less affected when the pen is relatively thin, but the ratio can vary significantly when the pen is relatively thick compared to the data values.
Figure 5 presents the procedure for computing TPMA for two step count time series. Panels (a) and (b) display the e-TPT results of the two dataset using a square pen with a thickness of 30, and panel (c) reveals the overlapping areas (purple) of the two e-TPT results. Finally, the result of the similarity measure of (2) is presented in panel (d). The measurement is low when a little overlap occurs between the two e-TPTs, whereas it is close to 1 when a considerable overlap exists. For comparison, we also present the TPMA values based on TPT with a square pen in panel (e) and TPMA values based on e-TPT in panel (f). Both TPMA and TPMA reflect the similarity between the two time series well. However, using (2), we can obtain more straightforward criteria for clustering, which is discussed in Section 2.3.
Fig. 5.
(a) and (b) e-TPT results using the square pen with a thickness of 30 for two step data. (c) Overlapping areas (purple) of the two e-TPT results. (d) TPMA values based on e-TPTs, (e) TPMA values based on TPTs. (f) TPMA values based on e-TPTs
Clustering Procedure Based on TPMA
The goal is to determine K optimal partitions of a set of observations , where each belongs to a domain set E. We let be a set of K partitions of the data that satisfies and for . We set as a set of cluster prototypes.
Given a distance function d, we define the clustering problem as minimizing the following cost function,
| 3 |
This optimization process is carried out in two steps using an iterative algorithm:
- Update P: Given a set of cluster prototypes M, update P with
- Update M: Given a partition P, update M with
The cost function decreases for each iteration step. A well-known K-means algorithm (Hartigan & Wong, 1979) deals with distance, leading to the mean of each component as a cluster prototype when . Furthermore, the distance function derives the K-medians algorithm using the medians as cluster prototypes (Leisch, 2006).
Suppose that we have multiple zero-inflated time series , . We obtain the corresponding upper boundaries of by e-TPT using a square pen with a thickness of , , . We compute the similarity measure TPMA of (2) between any two time series data and () and take the function. The measure can be further expressed as
Given a partition , we let be a cluster group label of the , and be a cluster prototype of the group belongs. Then, we maximize the geometric mean of the proposed similarity measure for each time t and element i, which is equivalent to minimize the sum of distance with respect to the logarithms of upper boundaries,
In other words, given a partition and the cluster prototypes, we have the following cost function to be minimized,
| 4 |
This problem is an optimization for the logarithms of the upper boundaries . Thus, applying the K-medians algorithm to this set ensures a monotonic decrease in the cost function.
As holds for , its log-transformation is problematic as approaches zero. However, the thickness of a pen guarantees the a minimal value of the upper boundaries sufficiently greater than zero. Thus, we assume that exists such that for any , as long as the upper boundaries of the transformed data are bounded above.
Practical Algorithm
The entire clustering scheme can be summarized by Algorithm 1. Suppose that we have N zero-inflated nonnegative time series data, . We assume that all time series data have the same scale, and the number of cluster groups K and the thickness are given. 
The followings are some remarks on the algorithm.
Cost function (4) can be viewed as an optimization problem for the set of logarithms of upper boundaries . Therefore, we apply K-medians algorithm to this set and selecting the cluster prototype as the median of the logarithmic values as Step 4. It is worth noting that the corresponding , cluster prototype of in (3), can be defined as the value satisfying , which is not unique for each . However, the clustering algorithm works only with and does not require to identify .
- This study considers various thickness values () for a multiscale interpretation of the results. Applying a thick pen tends to view data from a distance, focusing on significant trends; thus, the proposed clustering method divides the data based on global trends. Moreover, using a small value of thickness () tends to capture the pattern sensitively, and the corresponding clustering results reflect the detailed data pattern. However, in some cases, the optimal thickness () must be determined to obtain a single clustering result, where the cross-validation (CV) technique can be used to select the optimal value. More specifically, Algorithm 1 is applied to training data, and the cluster prototypes, , are obtained. Then, the cluster group label of test data is determined as
The cross-validated error is defined as
where is the number of time series in the test data set, represents the true cluster group label of , and I denotes the indicator function. A cluster validity index, such as the Dunn index (Pakhira et al., 2004) or Silhouette index (Shutaywi & Kachouie, 2021), may be used if the actual cluster groups are unknown. To determine the number of clusters K, we use the gap statistics from Tibshirani et al. (2001).
Simulation Study
This section conducts a simulation study to evaluate the empirical performance of the proposed method. For this purpose, we consider four types of zero-inflated time series data. The true number of clusters, K, is assumed to be known in all cases. The reproducible R code for simulation studies is provided at https://github.com/mjkim1001/ZITS.
Models for Simulation Data
Model 1: Nonstationary autoregressive model with abruptly changing parameters
This model was first considered by Fryzlewicz & Ombao (2009) for a classification problem. We modified the model slightly to have a zero-inflated time series structure and use it for clustering. The ith time series data from group g, denoted as , is generated from
| 5 |
where and i.i.d. N(0, 1). The time-varying parameters and are defined as in Table 1, where are different at . We generated time series from each group, and two sample time series with from each group are presented in Fig. 6. The average zero ratio of 100 time series is 0.501.
Table 1.
Time-varying parameters in Model 1
| Time varying parameters | Time index | Group | Group |
|---|---|---|---|
| 0.8 | 0.8 | ||
| -0.9 | 1.6 | ||
| 0.8 | 0.8 | ||
| -0.81 | -0.81 |
Fig. 6.
Sample time series with generated from Model 1. The red vertical lines indicate the change points, and 128
Model 2 : Nonstationary AR model with slowly changing parameters
We generated two cases of data from a nonstationary AR model with slowly changing parameters. Thus, we used (5) with different forms: for , and i.i.d. N(0, 1) ,
Case (a)
Case (b)
The average zero ratios for both cases are 0.5. The sample time series data from each group for both cases are illustrated in Fig. 7.
Fig. 7.
Sample time series generated from Model 2 with
Model 3 : Block data with different patterns
We considered a noisy block time series with four different patterns. To generate the time series, we reused (5) with the following , as
where , . In addition, satisfies , , , and , whose values are related to the height of each vertical jump, and is generated from , for . The average zero ratio of data from the above model is 0.494. The sample block time series data from each group are presented in Fig. 8.
Fig. 8.
Sample time series generated from Model 3 with
Model 4 : ZIP model with different mean
We considered a time series , , and , from group . The ith data in group g are generated from a zero-inflated Poisson model,
where is the expected Poisson count generated from , , and , and . The zero-inflation parameter, is generated from Unif(0.4, 0.7). The average zero ratio from the generated data set is 0.583. Figure 9 displays the sample time series from two groups with .
Fig. 9.
Sample time series generated from Model 4 with and
For comparison, we considered three existing functional and time series clustering methods:
FunFEM – Functional clustering based on discriminative functional mixture modeling by Bouveyron et al. (2015). We use the default criterion in the R package “funFEM”.
FunHDDC – Functional clustering based on the functional latent mixture modeling by Schmutz et al. (2018). We use the BIC to select the best model, and other hyper-parameters are set using default values in the R package “funHDDC”.
DTW – Time-series clustering based on the dynamic time warping (DTW) distance by Wang et al. (2018), which is implemented using the R package “dtwclust” by Sarda-Espinosa (2022).
Simulation Results
For the evaluation measure, we used the correct classification rate (CCR; %) and the adjusted Rand index (aRand) by Hubert & Arabie (1985). The aRand is a modified version of the Rand index (Rand, 1971), which adjusts the Rand index to have an expected value of 0 and to the upper bound of 1. It measures the correspondence between two partitions classifying the object pairs in a contingency table, and a higher value of the aRand index indicates a higher similarity between the two groups. Table 2 summarized the evaluation measures computed over 100 simulations.
Table 2.
Means and standard deviations (in parentheses) of the correct classification rate (CCR) and adjusted rand index (aRand) values
| Proposed TPT clustering | funFEM | funHDDC | DTW | ||||
|---|---|---|---|---|---|---|---|
| Model 1 | |||||||
| () | CCR | 0.833(0.03) | 0.847(0.029) | 0.851(0.027) | 0.853(0.027) | 0.871(0.028) | 0.842(0.133) |
| aRand | 0.445(0.079) | 0.483(0.079) | 0.493(0.076) | 0.463(0.077) | 0.552(0.083) | 0.535(0.264) | |
| () | CCR | 0.812(0.064) | 0.842(0.042) | 0.849(0.026) | 0.808(0.035) | 0.824(0.036) | 0.801(0.141) |
| aRand | 0.403(0.115) | 0.473(0.090) | 0.488(0.072) | 0.383(0.087) | 0.422(0.092) | 0.439(0.264) | |
| () | CCR | 0.775(0.087) | 0.820(0.074) | 0.844(0.029) | 0.798(0.034) | 0.800(0.039) | 0.670(0.145) |
| aRand | 0.329(0.150) | 0.429(0.137) | 0.473(0.081) | 0.357(0.082) | 0.364(0.092) | 0.197(0.211) | |
| Model 2 | |||||||
| (a) | CCR | 0.852(0.028) | 0.845(0.043) | 0.84(0.026) | 0.792(0.047) | 0.791(0.044) | 0.623(0.088) |
| aRand | 0.496(0.078) | 0.480(0.091) | 0.463(0.071) | 0.347(0.110) | 0.344(0.101) | 0.089(0.102) | |
| (b) | CCR | 0.827(0.029) | 0.826(0.029) | 0.816(0.030) | 0.758(0.045) | 0.756(0.048) | 0.593(0.076) |
| aRand | 0.427(0.075) | 0.425(0.077) | 0.400(0.076) | 0.271(0.093) | 0.268(0.098) | 0.055(0.071) | |
| Model 3 | |||||||
| CCR | 0.901(0.090) | 0.893(0.098) | 0.905(0.063) | 0.821(0.142) | 0.895(0.063) | 0.526(0.067) | |
| aRand | 0.781(0.152) | 0.775(0.144) | 0.782(0.106) | 0.651(0.205) | 0.762(0.107) | 0.272(0.071) | |
| Model 4 | |||||||
| () | CCR | 0.800(0.020) | 0.799(0.022) | 0.798(0.023) | 0.776 (0.021) | 0.779(0.027) | 0.763(0.034) |
| aRand | 0.359(0.050) | 0.357(0.053) | 0.355(0.055) | 0.303(0.046) | 0.311(0.06) | 0.279(0.068) | |
| () | CCR | 0.801(0.029) | 0.801(0.029) | 0.799(0.029) | 0.782(0.03) | 0.778(0.036) | 0.764(0.038) |
| aRand | 0.363(0.072) | 0.363(0.071) | 0.359(0.071) | 0.319(0.068) | 0.312(0.079) | 0.281(0.088) | |
Bold face indicates the best performance
In Model 1, the proposed TPT clustering with and funHDDC provides the best results. At , funHDDC works best, but its performance rapidly decreases as T increases. The reduction in accuracy for large T is observed for all methods, but the proposed method with works well even for . For Model 2, the proposed methods outperform other clustering methods for Cases (a) and (b). The proposed method with provides the best results. We obtain similar results for Models 3 and Model 4. In particular, the proposed method with gives the best results in Model 3, and all proposed clustering results reveal similar performances in Model 4. The simulation results indicate that the proposed methods can improve accuracy compared to existing methods when an appropriate thickness is used. However, it should be noted that the performance of the proposed method relies on the choice of thickness and underlying model, which may be difficult to determine in practical applications. Overall, the proposed method generally utilizes a multiscale strategy for pen thickness to explore clustering results at various scales and demonstrates good clustering performance when selecting the appropriate pen thickness suitable for the data properties.
As described in Section 2.4, we can use a five-fold CV to determine the optimal thickness for e-TPT. Table 3 summarizes the results. The CV results may perform worse than the proposed method with a specific thickness given in Table 2 in some cases, but they still offer better results than funFEM, funHDDC, and DTW, except Model 1 with .
Table 3.
Cross-validation results from each model
| Model 1 | Model 2 | Model 3 | Model 4 | |||||
|---|---|---|---|---|---|---|---|---|
| (a) | (b) | |||||||
| Selected thickness | 41.5 (20-100) | 65(30-150) | 74.22(20-150) | 31(10-100) | 37(20-100) | 96(10-150) | 42(10-150) | 40(10-100) |
| CCR | 0.853(0.027) | 0.845(0.043) | 0.837(0.053) | 0.851(0.045) | 0.828(0.038) | 0.904(0.086) | 0.804(0.021) | 0.805(0.029) |
| aRand | 0.499(0.077) | 0.481(0.086) | 0.463(0.107) | 0.498(0.094) | 0.433(0.089) | 0.792(0.132) | 0.370(0.051) | 0.372(0.072) |
Means and standard deviations (in parentheses) of the correct classification rate (CCR) and adjusted rand index (aRand) values, and means and ranges (in parentheses) of the selected thicknesses by CV
Finally, Table 4 summarized the computation time for each method conducted on the Model 1 dataset. For Model 1 with , the proposed method took an average of 13.39 seconds to run a simulation on a desktop machine equipped with an Apple M1 Pro 8-core CPU and 16GB of memory. At the same setting, funFEM took 4.63 seconds, funHDDC took 1.12 seconds, and DTW took 384.63 seconds. The proposed method took longer than funFEM and funHDDC, but it only took around 22 minutes to run 100 simulations, which is a reasonable computation time compared to that of DTW.
Table 4.
Mean and standard deviations (in parentheses) computation times (sec) for a simulation
| Proposed TPT Clustering | funFEM | funHDDC | DTW | |
|---|---|---|---|---|
| Model 1 () | 13.39 (2.54) | 4.63 (1.02) | 1.12 (0.72) | 384.63 (39.41) |
Real-data Analysis
Step Count Data
We applied the proposed clustering algorithm to the step count data obtained from a Fitbit, a wearable device. The step data from 79 participants were measured every minute, and the number of recorded days varies from 32 to 364 per person. The total number of days in the dataset is 21,394. We first clustered the days based on patterns without considering inter- or intra- subject variability. The scaling parameter is set to 0.2 for all cases, and the number of cluster groups is set to , which is determined by the gap statistic. Figure 10 illustrates the clustering results with the thicknesses and 100, and Table 5 lists the cluster size, mean step counts, and percentage of weekend days. Cluster groups are numbered in descending order, depending on the cluster size. For example, in the left panel of Fig. 10, Group 1 (red line) has the most number of days, and Group 6 (pink line) has the least number of days.
Fig. 10.
Mean time series of step data for each cluster by the proposed method with and 100
Table 5.
Summary of clustering results obtained from the proposed method when and 100
| Cluster | |||||||
|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | ||
| Number of Days | 5865 | 4827 | 3471 | 3092 | 2710 | 1429 | |
| Mean Step Count | 10817 | 10257 | 7283 | 2701 | 7441 | 11084 | |
| Weekend (%) | 10.6 | 31.7 | 45.8 | 38.3 | 30.7 | 11.5 | |
| Number of Days | 5380 | 5125 | 3704 | 2672 | 2543 | 1970 | |
| Mean Step Count | 10124 | 10989 | 8416 | 7948 | 6264 | 1758 | |
| Weekend (%) | 22.3 | 9.0 | 44.8 | 23.8 | 47.0 | 39.2 | |
From the mean curves shown in the left panel of Fig. 10, it is noticeable that the proposed method classifies the pattern and amount of activities. Group 4 contains the least number of activity days, whereas Group 6 includes the most days. The time when the activity starts in Groups 1 and 6 is faster than in the other groups. In addition, in Table 5, we observe that these two groups contain more weekdays than weekends compared to other groups. When , the mean time series in the right panel of Fig. 10 indicates that the proposed method properly classifies the days based on physical activity. The average pattern in each group is different from that in the left panel. For example, Group 3 contains days with activities that continued until midnight, and the days with this pattern are not grouped at .
For comparison, the funFEM and funHDDC methods are applied to the step data. Figure 11 displays the mean time series of each group, which are different from the proposed method. The DTW clustering method is excluded because it took too long to compute the DTW distance between 21,394 time series.
Fig. 11.
Mean time series of step data for each cluster by FunFEM and FunHDDC
The main difference between the clustering results using the proposed method and functional clustering methods is that the average number of steps in the least active group using the functional clustering methods is close to zero for all times, whereas the average time series of the least active group using the proposed method is far from zero. The proposed method uses the upper bound of e-TPT; thus, it is likely that time series with the most values of zero and those with all values of less than five are classified together using the proposed method. Depending on the purpose of the study, it may be essential to classify less-active days into one group. Therefore, the proposed clustering method can be used according to the purpose of the study.
We computed two clustering validation measures for the numerical validation of the clustering results: the Dunn index (Dunn , 1974) and variation of information (VI) (Meilă , 2007). The Dunn index measures the compactness of the intra-clusters and the inter-cluster separation, and VI measures the distance between clusters based on entropy. For both measures, a higher index indicates better clustering. In Table 6, the proposed method of and provides the highest values for the Dunn index and VI, respectively.
Table 6.
Clustering validation measures
| Dunn index | VI | ||
|---|---|---|---|
| Proposed method | 0.031 | 1.708 | |
| 0.023 | 1.726 | ||
| FunFEM | 0.018 | 0.877 | |
| FunHDDC | 0.029 | 1.576 | |
The proposed method can also be applied to cluster step data for a particular individual. For this purpose, we selected the 67th individual with 364 recorded days. To observe this individual’s activity patterns, we summarized their steps in Fig. 12, presenting the mean time series of the step data on weekdays and weekends and the data from the least and the most active days. We observe that, on weekends, activities continue until midnight compared to weekdays, and the activity level varies from day to day.
Fig. 12.
(Top) The average step data on weekdays and weekends of the 67th individual, and (Bottom) the step data from the least active day and the most active day
The results of the proposed method are provided in Fig. 13. The average time series from the four groups represent various levels and patterns of activity. However, the results are slightly different depending on the pen thickness. Days with activity early at are clustered as Group 2 (moderate activity) but are not in one group when . Moreover, e-TPT focuses on various aspects of the time series depending on the pen thickness. For example, we considered the day plotted in Fig. 14. When , midnight activities are more evident, and the day is clustered as Group 2 in Fig. 13(a). However, the midnight activities do not seem to be much different from those in the morning when , and we have relatively thick plots in the mornings, although there are few activities. The e-TPT is sensitive to the high activity intensity when using a thicker pen. Therefore, with , the time series is clustered into the most active group: Group 4 in Fig. 13(b).
Fig. 13.
Clustering result of 67th participant by the proposed method: (a) Thickness 20 – Mean time series of step data for each cluster; (b) Thickness 100 – Mean time series of step data for each cluster
Fig. 14.
TPT when thickness is (left) 20 and (right) 100 for a selected day. The black solid line indicates the original step data
Newly Confirmed COVID-19 Case Data
We considered the number of new COVID-19 cases per day in Seoul, Korea, from February 5, 2020, to June 18, 2021, as a time series of length 500. There are 25 districts in Seoul, as depicted in Fig. 15, and the total number of newly confirmed cases during this period is summarized in Table 7. The rates when the number of confirmed cases is zero during the given period are listed in the table and are higher than 28% in all districts. In Jongno-gu, there are zero confirmed cases on more than half of the days, and the highest number of confirmed cases, 81, is observed in Gangseo-gu.
Fig. 15.

Districts of Seoul, Korea
Table 7.
The total number of newly confirmed COVID-19 cases, the number of days with zero confirmed cases, and the maximum confirmed cases per day from February 5, 2020, to June 18, 2021, according to district in Seoul, Korea
| District | Total num of cases | Num of zero days (%) | Maximum confirmed cases in a day |
|---|---|---|---|
| Jongno-gu | 791 | 258 (51.6%) | 33 |
| Jung-gu | 741 | 250 (50.0%) | 11 |
| Yongsan-gu | 1294 | 197 (39.4%) | 28 |
| Seongdong-gu | 1296 | 201 (40.2%) | 21 |
| Gwangjin-gu | 1570 | 218 (43.6%) | 25 |
| Dongdaemun-gu | 1766 | 200 (40.0%) | 32 |
| Jungnang-gu | 2091 | 192 (38.4%) | 36 |
| Seongbuk-gu | 1957 | 183 (36.6%) | 39 |
| Gangbuk-gu | 1370 | 226 (45.2%) | 19 |
| Dobong-gu | 1467 | 187 (37.4%) | 22 |
| Nowon-gu | 2175 | 175 (35.0%) | 32 |
| Eunpyeong-gu | 2031 | 173 (34.6%) | 31 |
| Seodaemun-gu | 1194 | 210 (42.0%) | 15 |
| Mapo-gu | 1508 | 189 (37.8%) | 31 |
| Yangcheon-gu | 1641 | 202 (40.4%) | 33 |
| Gangseo-gu | 2273 | 158 (31.6%) | 81 |
| Guro-gu | 1565 | 192 (38.4%) | 40 |
| Geumcheon-gu | 794 | 249 (49.8 %) | 14 |
| Yeongdeungpo-gu | 1756 | 180 (36.0%) | 36 |
| Dongjak-gu | 1958 | 164 (32.8%) | 36 |
| Gwanak-gu | 2197 | 131 (26.2%) | 27 |
| Seocho-gu | 2058 | 159 (31.8%) | 37 |
| Gangnam-gu | 2852 | 144 (28.8%) | 38 |
| Songpa-gu | 2854 | 150 (30.0%) | 29 |
| Gangdong-gu | 1900 | 183 (36.6%) | 21 |
As illustrated in Fig. 3, the number of new COVID-19 cases per day in each district is zero-inflated time series data, and we apply the proposed method to cluster the 25 districts based on the time series patterns. The number of cluster groups is three, determined by the gap statistics. Figure 16 presents the clustering results using three different pen thicknesses (10, 30, and 100). The clustering results vary depending on the pen thickness. When , only one district, Gwanak-gu, is classified as Group 1. Gwanak-gu has the least days with zero confirmed cases. Group 2 includes Gangnam-gu and Songpa-gu, and these districts have the two highest total confirmed cases during this period. When , the proposed method brings out coarser-scale features of the data, and Gangseo-gu, which has the highest number of confirmed cases, is classified as a single group. Figure 17 displays the mean time series of each group, and Table 8 lists the summary statistics of the clustering results according to the pen thickness, such as the number of districts, average number of cases, and average rate of zero days. We observe that the levels and patterns of groups vary according to the pen thickness, and the statistics of the clustering results also vary.
Fig. 16.
Clustering results by the proposed method when and 100. Cluster groups are color-coded
Fig. 17.
Mean time series of COVID-19 data for each cluster by the proposed method with and 100.
Table 8.
Summary statistics for each cluster obtained from proposed method with thicknesses of the pen 10, 30, and 100
| Thickness | Cluster 1 | Cluster 2 | Cluster 3 | |
|---|---|---|---|---|
| 10 | Number of districts | 1 | 2 | 22 |
| Average number of cases | 4.394 | 5.706 | 3.200 | |
| Average percentage of zero days (%) | 26.20 | 29.40 | 39.51 | |
| 30 | Number of districts | 8 | 5 | 12 |
| Average number of cases | 2.263 | 3.450 | 4.237 | |
| Average percentage of zero days (%) | 45.23 | 35.68 | 34.50 | |
| 100 | Number of districts | 8 | 16 | 1 |
| Average number of cases | 2.263 | 3.972 | 4.546 | |
| Average percentage of zero days (%) | 45.23 | 35.05 | 31.60 |
We apply the FunFEM, FunHDDC, and DTW methods to compare the COVID-19 data. The clustering results are provided in Fig. 18. The results of FunFEM and FunHDDC are identical, and some parts are similar to those of the proposed method using a pen thickness of . The Dunn index and VI for the clustering results are presented in Table 9. The proposed method with and DTW shows high Dunn indices, while the proposed method with yields the highest VI.
Fig. 18.
Clustering results obtained from FunFEM (top left), FunHDDC (top right), and DTW (bottom). Cluster groups are color-coded
Table 9.
Clustering validation measures
| Dunn index | VI | ||
|---|---|---|---|
| Proposed method | 0.534 | 0.443 | |
| 0.418 | 1.039 | ||
| 0.469 | 0.779 | ||
| FunFEM | 0.361 | 0.895 | |
| FunHDDC | 0.462 | 1.021 | |
| DTW | 0.538 | 0.784 | |
Concluding Remarks
In this study, we proposed a novel clustering method that can be applied to high-dimensional zero-inflated time series data. By modifying the TPT, we developed the e-TPT to improve the temporal resolution of zero-inflated time series data and introduced a similarity measure for zero-inflated time series data as an input variable for the clustering algorithm. Furthermore, an efficient iterative clustering algorithm was proposed. Finally, the effectiveness of the proposed method was demonstrated using simulation experiments and real-data analyses with step count data and newly confirmed COVID-19 case data.
As e-TPT solves the problem of exceeding zero in zero-inflated data, the proposed method can cluster zero-inflated time series, which is commonly observed in various fields. In addition, the proposed method provides a multiscale view of the data by considering various thicknesses of the e-TPT. If we use a thick pen, we can cluster time series based on the global trend, and a thin pen renders cluster groups divided based on the local features of the data. Furthermore, the proposed method addresses missing data issues by utilizing the TPT, which can accommodate missing data through the consideration of a large thickness. Similarly, e-TPT also tackles missing data by transforming the raw series into smoothed time-series data.
However, the time series length must be the same for the current algorithm to be applied. Future studies could explore to handle time series with varying lengths. Another issue in the proposed method is finding the pen’s optimal thickness. Although the CV technique has been used for the thickness selection in the current study, an optimal choice using a data-adaptive selection may improve the clustering performance of the proposed method. It is reserved for future research.
Funding Information
This study is supported by the National Research Foundation of Korea (NRF) funded by the Korea government (2022R1F1A1074134; 2020R1A4A1018207; 2021R1A2C1091357, 2021R1A2B5B01001790).
Data Availability
Data available on request from the authors.
Declarations
Ethical standard
This article does not contain any studies with human participants performed by any of the authors.
Conflicts of interest
Authors declare that they have no conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- Bouveyron C, Côme E, Jacques J. The discriminative functional mixture model for a comparative analysis of bike sharing systems. Annals of Applied Statistics. 2015;9(4):1726–1760. doi: 10.1214/15-AOAS861. [DOI] [Google Scholar]
- Dunn JC. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics. 1974;4(1):95–104. doi: 10.1080/01969727408546059. [DOI] [Google Scholar]
- Fryzlewicz P, Oh H-S. Thick pen transformation for time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2011;73(4):499–529. doi: 10.1111/j.1467-9868.2011.00773.x. [DOI] [Google Scholar]
- Fryzlewicz P, Ombao H. Consistent classification of nonstationary time series using stochastic wavelet representations. Journal of the American Statistical Association. 2009;104(485):299–312. doi: 10.1198/jasa.2009.0110. [DOI] [Google Scholar]
- Hartigan J A, Wong M A. Algorithm as 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1979;28(1):100–108. [Google Scholar]
- Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. doi: 10.1007/BF01908075. [DOI] [Google Scholar]
- Leisch F. A toolbox for K-centroids cluster analysis. Computational Statistics & Data Analysis. 2006;51(2):526–544. doi: 10.1016/j.csda.2005.10.006. [DOI] [Google Scholar]
- Lim HK, Li WK, Philip L. Zero-inflated poisson regression mixture model. Computational Statistics & Data Analysis. 2014;71:151–158. doi: 10.1016/j.csda.2013.06.021. [DOI] [Google Scholar]
- Meilă M. Comparing clusterings–an information based distance. Journal of Multivariate Analysis. 2007;98(5):873–895. doi: 10.1016/j.jmva.2006.11.013. [DOI] [Google Scholar]
- Pakhira MK, Bandyopadhyay S, Maulik U. Validity index for crisp and fuzzy clusters. Pattern Recognition. 2004;37(3):487–501. doi: 10.1016/j.patcog.2003.06.005. [DOI] [Google Scholar]
- Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850. doi: 10.1080/01621459.1971.10482356. [DOI] [Google Scholar]
- Sarda-Espinosa A. dtwclust: Time series clustering along with optimizations for the dynamic time warping distance. R package version. 2023;5(5):12. [Google Scholar]
- Schmutz A, Jacques J, Bouveyron C, Cheze L, Martin P. Clustering multivariate functional data in group-specific functional subspaces. Computational Statistics. 2018;35(3):1101–1131. doi: 10.1007/s00180-020-00958-4. [DOI] [Google Scholar]
- Shutaywi M, Kachouie NN. Silhouette analysis for performance evaluation in machine learning with applications to clustering. Entropy. 2021;23(6):759. doi: 10.3390/e23060759. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tawiah K, Iddrisu WA, Asampana Asosega K. Zero-inflated time series modelling of COVID-19 deaths in Ghana. Journal of Environmental and Public Health. 2021;2021:5543977. doi: 10.1155/2021/5543977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2001;63:411–423. doi: 10.1111/1467-9868.00293. [DOI] [Google Scholar]
- Wang, W., Lyu, G., Shi, Y., & Liang, X. (2018). Time series clustering based on dynamic time warping. In 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), (pp. 487–490). IEEE
- Yau KK, Wang K, Lee AH. Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometrical Journal. 2003;45(4):437–452. doi: 10.1002/bimj.200390024. [DOI] [Google Scholar]
- Zhang X, Guo B, Yi N. Zero-inflated gaussian mixed models for analyzing longitudinal microbiome data. PLoS One. 2020;15(11):e0242073. doi: 10.1371/journal.pone.0242073. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data available on request from the authors.

















