Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2023 Jun 12:1–25. Online ahead of print. doi: 10.1007/s00357-023-09437-z

Zero-Inflated Time Series Clustering Via Ensemble Thick-Pen Transform

Minji Kim 1, Hee-Seok Oh 2, Yaeji Lim 3,
PMCID: PMC10258486  PMID: 37359508

Abstract

This study develops a new clustering method for high-dimensional zero-inflated time series data. The proposed method is based on thick-pen transform (TPT), in which the basic idea is to draw along the data with a pen of a given thickness. Since TPT is a multi-scale visualization technique, it provides some information on the temporal tendency of neighborhood values. We introduce a modified TPT, termed ‘ensemble TPT (e-TPT)’, to enhance the temporal resolution of zero-inflated time series data that is crucial for clustering them efficiently. Furthermore, this study defines a modified similarity measure for zero-inflated time series data considering e-TPT and proposes an efficient iterative clustering algorithm suitable for the proposed measure. Finally, the effectiveness of the proposed method is demonstrated by simulation experiments and two real datasets: step count data and newly confirmed COVID-19 case data.

Keywords: Clustering, Multiscale method, Newly confirmed COVID-19 case data, Step count data, Thick-pen transform, Zero-inflated time series data

Introduction

Clustering is a popular unsupervised machine learning technique for identifying patterns and groupings in data, which has been widely used in many domains, including biology, finance, and image processing. However, many real-world datasets, especially those in healthcare, finance, and environmental monitoring, often exhibit zero inflation, which refers to excessive zeros in the data. This characteristic poses significant challenges to traditional clustering algorithms, which assume that the data points follow a specific distribution or pattern. In particular, zero-inflated time series data are prevalent in many domains, such as disease surveillance and financial transaction analyses. For instance, in epidemiology, the counts of infectious diseases are often zero-inflated due to under-reporting, misclassification, and other factors. Various methods have been proposed to address the challenges of clustering zero-inflated time series data, including a zero-inflated Gaussian mixture model (Zhang et al., 2020), a zero-inflated Poisson mixture model (Lim et al., 2014), and a zero-inflated negative binomial mixture model (Yau et al., 2003). These methods aim to model the zero-inflation by adding extra parameters or components to the mixture model and effectively cluster zero-inflated time series data.

In this study, we propose a new clustering method without specific model assumptions that can be applied to various structures of zero-inflated time series data. Selecting an appropriate distance (similarity) measure in time series data clustering is essential. Thus, we propose a similarity measure suitable for zero-inflated time series data inspired by the thick-pen transform (TPT) by Fryzlewicz & Oh (2011). The TPT is a novel way of visualizing time series data at multiple scales using a range of pens with various thicknesses. To improve the temporal resolution of zero-inflated time series data, which is crucial for efficient clustering, we introduce two modifications: the ensemble TPT (e-TPT) and a modified similarity measure called TPMA0. These approaches have the advantage of overriding some original properties of the TPT, capturing time series trends of neighboring data points, and reflecting the multi-scale information of the data. Then, we present a clustering algorithm based on the proposed similarity measure.

The primary rationale of the proposed method is that e-TPT can effectively manage the issue of excessive zeros in zero-inflated time series data. To demonstrate this, we present two zero-inflated time series in Fig. 1, where the proportions of zero observations are 0.495 and 0.480. We apply e-TPT with a square pen, as explained in Section 2.1, and obtain the upper boundary of the pen. The lower boundary of the e-TPT for zero-inflated time series data rarely fluctuates; thus, we only consider the upper boundary of the pen. The upper boundary from the pen with a thickness of 100 manifests the global trend of the data from the two time series, and the two series are distinguished. Moreover, there is no zero observation in the upper boundary of the e-TPT, indicating that a simple clustering method may work well without considering the problem of exceeding zero.

Fig. 1.

Fig. 1

From left to right, two simulated zero-inflated time series, their e-TPT with a square pen with a thickness of 5, and upper boundaries obtained from the e-TPT with thicknesses of 5 and 100

This study is motivated by two real-world time series. The first comprises data on the number of steps recorded from wearable devices. Figure 2 depicts the step data recorded for three days. As expected, zero values occur frequently, and daily activity patterns are observed. A proper clustering of step data can provide rich information about physical activities and can be further used for personal healthcare services.

Fig. 2.

Fig. 2

Step count data for three different days

We consider newly confirmed coronavirus disease 2019 (COVID-19) cases per day in Seoul, Korea, as the second time series dataset. South Korea had its first confirmed COVID-19 case in January 2020. As of February 2022, the cumulative number of confirmed cases was more than 2,665,000. Figure 3 illustrates the number of new COVID-19 cases per day in three districts in Seoul from February 5, 2020, to June 18, 2021. Before November 2020, few new cases of COVID-19 were confirmed in all three districts, but the number of confirmed cases suddenly increased in the winter of 2020. The days with zero confirmed cases are 51.6%, 31.6%, and 50%, respectively. This data analysis aims to observe the time series patterns of confirmed cases that vary from district to district and cluster the 25 districts in Seoul based on the patterns of confirmed COVID-19 cases per day. Recently, many COVID-19-related studies have been conducted, and the number of deaths or confirmed cases is modeled using zero-inflated time series models. For example, Tawiah et al. (2021) analyzed the trend of a daily count of COVID-19 deaths in Ghana using a zero-inflated Poisson autoregressive model and a zero-inflated negative binomial autoregressive model.

Fig. 3.

Fig. 3

Newly confirmed COVID-19 cases per day in three districts of Seoul from February 5, 2020 to June 18, 2021

The remainder of this paper is organized as follows. Section 2 introduces an e-TPT and proposes a new similarity measure based on the e-TPT. In addition, the proposed clustering method and its practical algorithm are presented. In Section 3, a simulation study is conducted to evaluate the empirical performance of the proposed method. Next, Section 4 discusses the real-data analysis with two real datasets: step-count data and newly confirmed COVID-19 cases data. The concluding remarks are provided in Section 5.

Proposed Clustering Procedure

Ensemble Thick-Pen Transform

The TPT is based on the idea of drawing along time series data points with a pen with a shape and thickness. We let J={τj>0:j=1,,|J|} denote a set of thickness parameters. The TPT of a real-valued univariate process {X(t)}t=1T is defined as the following sequence of boundary pairs:

TPJ(X(t))={(Lτj(X(t)),Uτj(X(t)))}j=1,,|J|,

where Lτj(X(t)) and Uτj(X(t)) represent the lower and upper boundaries of the area covered by a pen of thickness τj at time t, respectively. As for the pen shape, Fryzlewicz & Oh (2011) considered square and round shapes as follows.

  1. Square pen
    Uτ(X(t))=maxk[-τ2,τ2]Z{X(t+k)}+τ2γ,
    Lτ(X(t))=mink[-τ2,τ2]Z{X(t+k)}-τ2γ.
  2. Round pen
    Uτ(X(t))=maxk[-τ2,τ2]Z{X(t+k)+γτ2/4-k2},
    Lτ(X(t))=mink[-τ2,τ2]Z{X(t+k)-γτ2/4-k2}.

Above, Z denotes the set of integers, and γ represents the scaling factor defined to adjust the difference between the thickness of the pen and the data variability. As Fryzlewicz & Oh (2011) suggested, γ is always set to γ=0.1 unless otherwise stated.

The TPT has a multi-scale feature of viewing data at different distances according to the thickness of the pen. Specifically, applying large τ values corresponds to zooming out and coarsely viewing data trends, whereas small τ values sensitively capture the original features. Further, this transformation is visually intuitive and informative. Figure 4(a) and (b) display the boundaries obtained by applying a square pen with thicknesses of τ=30 and 80, respectively, to the step data recorded on a specific day. The data trend with τ=80 is coarser than that with τ=30.

Fig. 4.

Fig. 4

(a) and (b) TPT with thicknesses of 30 and 80. (c) and (d) e-TPT with thicknesses of 30 and 80. The original step data are plotted with a solid black line

In this study, we consider a variation of the TPT to obtain a smooth version of the thick-pen boundaries, enhancing the temporal resolution of the time-series data and making the proposed clustering performance more effective. Thus, we define the upper and lower ensemble boundaries of a real-valued univariate process {X(t)}t=1T with a square pen with a thickness of τ as

UτE(X(t))=1τ+1=0τmax{X(t-),,X(t+τ-)}+τ2γ,
LτE(X(t))=1τ+1=0τmin{X(t-),,X(t+τ-)}-τ2γ,

which are the ensemble means of boundaries created with different starting points s. Thus, the ensemble TPT (e-TPT) of {X(t)}t=1T is defined as the sequence of pairs of the ensemble boundaries,

TPJE(X(t))={(LτjE(X(t)),UτjE(X(t)))}j=1,2,,|J|.

Using the average value of the boundaries, the ensemble TPT provides smoother boundaries than the conventional TPT and is less sensitive to the initial data values and outliers. Figure 4(c) and (d) illustrate the ensemble boundaries with thicknesses of 30 and 80, where the boundaries are much smoother than the conventional ones in panels (a) and (b).

Similarity Measure for Clustering

This section proposes a similarity measure employed as the input variable for clustering zero-inflated time-series data. For this purpose, we consider the thick-pen measure of association (TPMA) between the two time series X(t) and Y(t) proposed by Fryzlewicz & Oh (2011). Suppose that X(t) and Y(t) are on approximately the same scale. The TPMA is then defined as

ρτ(X(t),Y(t))=min{Uτ(X(t)),Uτ(Y(t))}-max{Lτ(X(t)),Lτ(Y(t))}max{Uτ(X(t)),Uτ(Y(t))}-min{Lτ(X(t)),Lτ(Y(t))},fort=1,,T. 1

Moreover, ρτ(X(t),Y(t))(-1,1], and ρτ(X(t),Y(t))>0 holds when an overlap exists between the two boundaries, whereas ρτ(X(t),Y(t))<0 when a gap exists between the two boundaries. This idea of measuring time series dependence based on the overlap or gap of pen areas is intuitively recognized through the visualization of transformations.

To reflect the characteristics of zero-inflated time series data, we propose a new similarity measure based on e-TPT and TPMA. From now on, we assume that the given time series data are nonnegative and zero-inflated. Then, the lower boundary of e-TPT for zero-inflated time series data rarely fluctuates. Therefore, it is natural to modify the TPMA measure of (1) to set the lower boundary of the pen to zero. Then, the modified TPMA measure, TPMA0, is defined as

ητE(X(t),Y(t)):=min{UτE(X(t)),UτE(Y(t))}max{UτE(X(t)),UτE(Y(t))}, 2

for each time and its geometric mean over time has been proposed as a measure to assess the similarity between two time series. Here, ητE(X(t),Y(t)) measures the intersection length between [0,UτE(X(t))] and [0,UτE(Y(t))] as a proportion of the union’s size of these two intervals. Thus, 0<ητE(X(t),Y(t))1 holds for τ>0. This measure returns a value close to 1 when the two time series are similar at time t. It is noticeable that the e-TPT transformation can affect the ratio due to the pen thickness. For example, the ratio is less affected when the pen is relatively thin, but the ratio can vary significantly when the pen is relatively thick compared to the data values.

Figure 5 presents the procedure for computing TPMA0 for two step count time series. Panels (a) and (b) display the e-TPT results of the two dataset using a square pen with a thickness of 30, and panel (c) reveals the overlapping areas (purple) of the two e-TPT results. Finally, the result of the similarity measure ητE(X(t),Y(t)) of (2) is presented in panel (d). The measurement is low when a little overlap occurs between the two e-TPTs, whereas it is close to 1 when a considerable overlap exists. For comparison, we also present the TPMA0 values based on TPT with a square pen in panel (e) and TPMA values based on e-TPT in panel (f). Both TPMA0 and TPMA reflect the similarity between the two time series well. However, using (2), we can obtain more straightforward criteria for clustering, which is discussed in Section 2.3.

Fig. 5.

Fig. 5

(a) and (b) e-TPT results using the square pen with a thickness of 30 for two step data. (c) Overlapping areas (purple) of the two e-TPT results. (d) TPMA0 values based on e-TPTs, (e) TPMA0 values based on TPTs. (f) TPMA values based on e-TPTs

Clustering Procedure Based on TPMA0

The goal is to determine K optimal partitions of a set of observations X={X1,,XN}, where each Xi belongs to a domain set E. We let P={P1,,PK} be a set of K partitions of the data that satisfies c=1KPc=X and PiPj= for ij. We set M={m1,,mK:mcE,c=1,,K} as a set of cluster prototypes.

Given a distance function d, we define the clustering problem as minimizing the following cost function,

W(P,M):=c=1KXPcd(X,mc). 3

This optimization process is carried out in two steps using an iterative algorithm:

  1. Update P: Given a set of cluster prototypes M, update P with
    Pc={Xi:argminmMd(Xi,m)=mc,i=1,,N}for eachc{1,,K}.
  2. Update M: Given a partition P, update M with
    mc=argminmEXPcd(X,m)for eachc{1,,K}.

The cost function decreases for each iteration step. A well-known K-means algorithm (Hartigan & Wong, 1979) deals with L2 distance, leading to the mean of each component as a cluster prototype when E=Rn,nN. Furthermore, the L1 distance function derives the K-medians algorithm using the medians as cluster prototypes (Leisch, 2006).

Suppose that we have multiple zero-inflated time series Xi(t), i=1,,N. We obtain the corresponding upper boundaries of Xi(t) by e-TPT using a square pen with a thickness of τ, UτE(Xi(t)), i=1,,N. We compute the similarity measure TPMA0 of (2) between any two time series data Xi(t) and Xj(t) (ij) and take the log function. The measure can be further expressed as

log(ητE(Xi(t),Xj(t)))=logmin{UτE(Xi(t)),UτE(Xj(t))}max{UτE(Xi(t)),UτE(Xj(t))}=-logUτE(Xi(t))UτE(Xj(t)).

Given a partition {P1,,PK}, we let ci{1,,K} be a cluster group label of the Xi(t), and mci(t) be a cluster prototype of the group Xi(t) belongs. Then, we maximize the geometric mean of the proposed similarity measure for each time t and element i, which is equivalent to minimize the sum of L1 distance with respect to the logarithms of upper boundaries,

maximizeP,Mi=1N{t=1TητE(Xi(t),mci(t))}1/TmaximizeP,Mi=1Nt=1T1Tlog(ητE(Xi(t),mci(t)))minimizeP,Mt=1Ti=1NlogUτE(Xi(t))-logUτE(mci(t)).

In other words, given a partition and the cluster prototypes, we have the following cost function to be minimized,

W(P,M)=t=1Ti=1NlogUτE(Xi(t))-logUτE(mci(t)). 4

This problem is an L1 optimization for the logarithms of the upper boundaries {logUτE(Xi(t)),i=1,,N}. Thus, applying the K-medians algorithm to this set ensures a monotonic decrease in the cost function.

As 0<ητE(Xi(t),Xj(t))1 holds for τ>0, its log-transformation is problematic as ητE(Xi,Xj) approaches zero. However, the thickness of a pen guarantees the a minimal value of the upper boundaries sufficiently greater than zero. Thus, we assume that δ exists such that ητE(Xi(t),Xj(t))>δ>0 for any i,j{1,,N}, as long as the upper boundaries of the transformed data are bounded above.

Practical Algorithm

The entire clustering scheme can be summarized by Algorithm 1. Suppose that we have N zero-inflated nonnegative time series data, X1,,XN. We assume that all time series data have the same scale, and the number of cluster groups K and the thickness τ are given. graphic file with name 357_2023_9437_Figa_HTML.jpg

The followings are some remarks on the algorithm.

  • Cost function (4) can be viewed as an L1 optimization problem for the set of logarithms of upper boundaries {logUτE(Xi(t)),i=1,,N}. Therefore, we apply K-medians algorithm to this set and selecting the cluster prototype as the median of the logarithmic values as Step 4. It is worth noting that the corresponding mc(t), cluster prototype of XPc in (3), can be defined as the value satisfying μc(t)=logUτE(mc(t)), which is not unique for each μc(t). However, the clustering algorithm works only with μc(t) and does not require to identify mc(t).

  • This study considers various thickness values (τ) for a multiscale interpretation of the results. Applying a thick pen tends to view data from a distance, focusing on significant trends; thus, the proposed clustering method divides the data based on global trends. Moreover, using a small value of thickness (τ) tends to capture the pattern sensitively, and the corresponding clustering results reflect the detailed data pattern. However, in some cases, the optimal thickness (τ) must be determined to obtain a single clustering result, where the cross-validation (CV) technique can be used to select the optimal value. More specifically, Algorithm 1 is applied to training data, and the cluster prototypes, μc(t), c{1,2,,K} are obtained. Then, the cluster group label ci of test data Xite is determined as
    ci=argminct=1T|logUτE(Xite(t))-μc(t)|.
    The cross-validated error is defined as
    CV=1ntei=1nteI(cici,true),
    where nte is the number of time series in the test data set, ci,true represents the true cluster group label of Xite, and I denotes the indicator function. A cluster validity index, such as the Dunn index (Pakhira et al., 2004) or Silhouette index (Shutaywi & Kachouie, 2021), may be used if the actual cluster groups are unknown.
  • To determine the number of clusters K, we use the gap statistics from Tibshirani et al. (2001).

Simulation Study

This section conducts a simulation study to evaluate the empirical performance of the proposed method. For this purpose, we consider four types of zero-inflated time series data. The true number of clusters, K, is assumed to be known in all cases. The reproducible R code for simulation studies is provided at https://github.com/mjkim1001/ZITS.

Models for Simulation Data

Model 1: Nonstationary autoregressive model with abruptly changing parameters

This model was first considered by Fryzlewicz & Ombao (2009) for a classification problem. We modified the model slightly to have a zero-inflated time series structure and use it for clustering. The ith time series data from group g, denoted as Xi(g)(t), is generated from

Xi(g)(t)=Yi(g)(t),ifYi(g)(t)00,otherwise,fori=1,,N;t=1,,T, 5

where Yi(g)(t)=ϕ1(g)Yi(g)(t-1)+ϕ2(g)Yi(g)(t-2)+ϵi(g)(t), and ϵi(g)(t) i.i.d. N(0, 1). The time-varying parameters ϕ1(g) and ϕ2(g) are defined as in Table 1, where ϕ1(g) are different at t=54,,128. We generated N=100 time series from each group, and two sample time series with T=500 from each group are presented in Fig. 6. The average zero ratio of 100 time series is 0.501.

Table 1.

Time-varying parameters in Model 1

Time varying parameters Time index Group g=1 Group g=2
ϕ1(g) t=1,,53 0.8 0.8
t=54,,128 -0.9 1.6
t=129,,T 0.8 0.8
ϕ2(g) t=1,,T -0.81 -0.81

Fig. 6.

Fig. 6

Sample time series with T=500 generated from Model 1. The red vertical lines indicate the change points, t=54 and 128

Model 2 : Nonstationary AR model with slowly changing parameters

We generated two cases of data from a nonstationary AR model with slowly changing parameters. Thus, we used (5) with different Yi(g)(t) forms: for i=1,,N(=100);t=1,,T(=500), and ϵt(g)(t) i.i.d. N(0, 1) (g=1,2),

Case (a)

Yi(1)(t)=-0.8[1-0.7cos(πt/T)]Yi(1)(t-1)-0.81Yi(1)(t-2)+ϵi(1)(t),Yi(2)(t)=-0.8[1-0.001cos(πt/T)]Yi(2)(t-1)-0.81Yi(2)(t-2)+ϵi(2)(t).

Case (b)

Yi(1)(t)=-0.8[1-0.7cos(πt/T)]Yi(1)(t-1)-0.81Yi(1)(t-2)+ϵi(1)(t)Yi(2)(t)=-0.8[1-0.1cos(πt/T)]Yi(2)(t-1)-0.81Yi(2)(t-2)+ϵi(2)(t).

The average zero ratios for both cases are 0.5. The sample time series data from each group for both cases are illustrated in Fig. 7.

Fig. 7.

Fig. 7

Sample time series generated from Model 2 with T=500

Model 3 : Block data with different patterns

We considered a noisy block time series with four different patterns. To generate the time series, we reused (5) with the following Yi(g)(t), g{1,2,3,4} as

Yi(g)(t)=j=15hj{1+sign((t-1)/T-ξj(g))}/2+ϵi(g)(t),fori=1,,N(=100),t=1,,T(=500),

where ϵi(g)N(0,32), g=1,2,3,4. In addition, hj satisfies |hj|U(0,20), h1,h3<0, h2,h4>0, and j=15hj=0, whose values are related to the height of each vertical jump, and ξj(g) is generated from U(g-15,g+15), for g=1,2,3,4. The average zero ratio of N(=100) data from the above model is 0.494. The sample block time series data from each group are presented in Fig. 8.

Fig. 8.

Fig. 8

Sample time series generated from Model 3 with T=500

Model 4 : ZIP model with different mean

We considered a time series Xi(g)(t), t=1,,T(=500), and i=1,,N(=100), from group g{1,2}. The ith data in group g are generated from a zero-inflated Poisson model,

Xi(g)(t)ZIP(λig,ωi),

where λig is the expected Poisson count generated from N(μig,σ2), g=1,2, μi1Unif(3,4) and μi2Unif(2,10), and σ=0.1,0.5. The zero-inflation parameter, ωi is generated from Unif(0.4, 0.7). The average zero ratio from the generated data set is 0.583. Figure 9 displays the sample time series from two groups with σ=0.1.

Fig. 9.

Fig. 9

Sample time series generated from Model 4 with T=500 and σ=0.1

For comparison, we considered three existing functional and time series clustering methods:

  • FunFEM – Functional clustering based on discriminative functional mixture modeling by Bouveyron et al. (2015). We use the default criterion in the R package “funFEM”.

  • FunHDDC – Functional clustering based on the functional latent mixture modeling by Schmutz et al. (2018). We use the BIC to select the best model, and other hyper-parameters are set using default values in the R package “funHDDC”.

  • DTW – Time-series clustering based on the dynamic time warping (DTW) distance by Wang et al. (2018), which is implemented using the R package “dtwclust” by Sarda-Espinosa (2022).

Simulation Results

For the evaluation measure, we used the correct classification rate (CCR; %) and the adjusted Rand index (aRand) by Hubert & Arabie (1985). The aRand is a modified version of the Rand index (Rand, 1971), which adjusts the Rand index to have an expected value of 0 and to the upper bound of 1. It measures the correspondence between two partitions classifying the object pairs in a contingency table, and a higher value of the aRand index indicates a higher similarity between the two groups. Table 2 summarized the evaluation measures computed over 100 simulations.

Table 2.

Means and standard deviations (in parentheses) of the correct classification rate (CCR) and adjusted rand index (aRand) values

Proposed TPT clustering funFEM funHDDC DTW
τ=20 τ=30 τ=50
Model 1
(T=500) CCR 0.833(0.03) 0.847(0.029) 0.851(0.027) 0.853(0.027) 0.871(0.028) 0.842(0.133)
aRand 0.445(0.079) 0.483(0.079) 0.493(0.076) 0.463(0.077) 0.552(0.083) 0.535(0.264)
(T=1000) CCR 0.812(0.064) 0.842(0.042) 0.849(0.026) 0.808(0.035) 0.824(0.036) 0.801(0.141)
aRand 0.403(0.115) 0.473(0.090) 0.488(0.072) 0.383(0.087) 0.422(0.092) 0.439(0.264)
(T=1500) CCR 0.775(0.087) 0.820(0.074) 0.844(0.029) 0.798(0.034) 0.800(0.039) 0.670(0.145)
aRand 0.329(0.150) 0.429(0.137) 0.473(0.081) 0.357(0.082) 0.364(0.092) 0.197(0.211)
Model 2
(a) CCR 0.852(0.028) 0.845(0.043) 0.84(0.026) 0.792(0.047) 0.791(0.044) 0.623(0.088)
aRand 0.496(0.078) 0.480(0.091) 0.463(0.071) 0.347(0.110) 0.344(0.101) 0.089(0.102)
(b) CCR 0.827(0.029) 0.826(0.029) 0.816(0.030) 0.758(0.045) 0.756(0.048) 0.593(0.076)
aRand 0.427(0.075) 0.425(0.077) 0.400(0.076) 0.271(0.093) 0.268(0.098) 0.055(0.071)
Model 3
CCR 0.901(0.090) 0.893(0.098) 0.905(0.063) 0.821(0.142) 0.895(0.063) 0.526(0.067)
aRand 0.781(0.152) 0.775(0.144) 0.782(0.106) 0.651(0.205) 0.762(0.107) 0.272(0.071)
Model 4
(σ=0.1) CCR 0.800(0.020) 0.799(0.022) 0.798(0.023) 0.776 (0.021) 0.779(0.027) 0.763(0.034)
aRand 0.359(0.050) 0.357(0.053) 0.355(0.055) 0.303(0.046) 0.311(0.06) 0.279(0.068)
(σ=0.5) CCR 0.801(0.029) 0.801(0.029) 0.799(0.029) 0.782(0.03) 0.778(0.036) 0.764(0.038)
aRand 0.363(0.072) 0.363(0.071) 0.359(0.071) 0.319(0.068) 0.312(0.079) 0.281(0.088)

Bold face indicates the best performance

In Model 1, the proposed TPT clustering with τ=50 and funHDDC provides the best results. At T=500, funHDDC works best, but its performance rapidly decreases as T increases. The reduction in accuracy for large T is observed for all methods, but the proposed method with τ=50 works well even for T=1500. For Model 2, the proposed methods outperform other clustering methods for Cases (a) and (b). The proposed method with τ=20 provides the best results. We obtain similar results for Models 3 and Model 4. In particular, the proposed method with τ=50 gives the best results in Model 3, and all proposed clustering results reveal similar performances in Model 4. The simulation results indicate that the proposed methods can improve accuracy compared to existing methods when an appropriate thickness is used. However, it should be noted that the performance of the proposed method relies on the choice of thickness and underlying model, which may be difficult to determine in practical applications. Overall, the proposed method generally utilizes a multiscale strategy for pen thickness to explore clustering results at various scales and demonstrates good clustering performance when selecting the appropriate pen thickness suitable for the data properties.

As described in Section 2.4, we can use a five-fold CV to determine the optimal thickness for e-TPT. Table 3 summarizes the results. The CV results may perform worse than the proposed method with a specific thickness given in Table 2 in some cases, but they still offer better results than funFEM, funHDDC, and DTW, except Model 1 with T=500.

Table 3.

Cross-validation results from each model

Model 1 Model 2 Model 3 Model 4
T=500 T=1000 T=1500 (a) (b) σ=0.1 σ=0.5
Selected thickness 41.5 (20-100) 65(30-150) 74.22(20-150) 31(10-100) 37(20-100) 96(10-150) 42(10-150) 40(10-100)
CCR 0.853(0.027) 0.845(0.043) 0.837(0.053) 0.851(0.045) 0.828(0.038) 0.904(0.086) 0.804(0.021) 0.805(0.029)
aRand 0.499(0.077) 0.481(0.086) 0.463(0.107) 0.498(0.094) 0.433(0.089) 0.792(0.132) 0.370(0.051) 0.372(0.072)

Means and standard deviations (in parentheses) of the correct classification rate (CCR) and adjusted rand index (aRand) values, and means and ranges (in parentheses) of the selected thicknesses by CV

Finally, Table 4 summarized the computation time for each method conducted on the Model 1 dataset. For Model 1 with T=500, the proposed method took an average of 13.39 seconds to run a simulation on a desktop machine equipped with an Apple M1 Pro 8-core CPU and 16GB of memory. At the same setting, funFEM took 4.63 seconds, funHDDC took 1.12 seconds, and DTW took 384.63 seconds. The proposed method took longer than funFEM and funHDDC, but it only took around 22 minutes to run 100 simulations, which is a reasonable computation time compared to that of DTW.

Table 4.

Mean and standard deviations (in parentheses) computation times (sec) for a simulation

Proposed TPT Clustering funFEM funHDDC DTW
Model 1 (T=500) 13.39 (2.54) 4.63 (1.02) 1.12 (0.72) 384.63 (39.41)

Real-data Analysis

Step Count Data

We applied the proposed clustering algorithm to the step count data obtained from a Fitbit, a wearable device. The step data from 79 participants were measured every minute, and the number of recorded days varies from 32 to 364 per person. The total number of days in the dataset is 21,394. We first clustered the days based on patterns without considering inter- or intra- subject variability. The scaling parameter γ is set to 0.2 for all cases, and the number of cluster groups is set to K=6, which is determined by the gap statistic. Figure 10 illustrates the clustering results with the thicknesses τ=20 and 100, and Table 5 lists the cluster size, mean step counts, and percentage of weekend days. Cluster groups are numbered in descending order, depending on the cluster size. For example, in the left panel of Fig. 10, Group 1 (red line) has the most number of days, and Group 6 (pink line) has the least number of days.

Fig. 10.

Fig. 10

Mean time series of step data for each cluster by the proposed method with τ=20 and 100

Table 5.

Summary of clustering results obtained from the proposed method when τ=20 and 100

Cluster
1 2 3 4 5 6
Number of Days 5865 4827 3471 3092 2710 1429
τ=20 Mean Step Count 10817 10257 7283 2701 7441 11084
Weekend (%) 10.6 31.7 45.8 38.3 30.7 11.5
Number of Days 5380 5125 3704 2672 2543 1970
τ=100 Mean Step Count 10124 10989 8416 7948 6264 1758
Weekend (%) 22.3 9.0 44.8 23.8 47.0 39.2

From the mean curves shown in the left panel of Fig. 10, it is noticeable that the proposed method classifies the pattern and amount of activities. Group 4 contains the least number of activity days, whereas Group 6 includes the most days. The time when the activity starts in Groups 1 and 6 is faster than in the other groups. In addition, in Table 5, we observe that these two groups contain more weekdays than weekends compared to other groups. When τ=100, the mean time series in the right panel of Fig. 10 indicates that the proposed method properly classifies the days based on physical activity. The average pattern in each group is different from that in the left panel. For example, Group 3 contains days with activities that continued until midnight, and the days with this pattern are not grouped at τ=20.

For comparison, the funFEM and funHDDC methods are applied to the step data. Figure 11 displays the mean time series of each group, which are different from the proposed method. The DTW clustering method is excluded because it took too long to compute the DTW distance between 21,394 time series.

Fig. 11.

Fig. 11

Mean time series of step data for each cluster by FunFEM and FunHDDC

The main difference between the clustering results using the proposed method and functional clustering methods is that the average number of steps in the least active group using the functional clustering methods is close to zero for all times, whereas the average time series of the least active group using the proposed method is far from zero. The proposed method uses the upper bound of e-TPT; thus, it is likely that time series with the most values of zero and those with all values of less than five are classified together using the proposed method. Depending on the purpose of the study, it may be essential to classify less-active days into one group. Therefore, the proposed clustering method can be used according to the purpose of the study.

We computed two clustering validation measures for the numerical validation of the clustering results: the Dunn index (Dunn , 1974) and variation of information (VI) (Meilă , 2007). The Dunn index measures the compactness of the intra-clusters and the inter-cluster separation, and VI measures the distance between clusters based on entropy. For both measures, a higher index indicates better clustering. In Table 6, the proposed method of τ=20 and τ=100 provides the highest values for the Dunn index and VI, respectively.

Table 6.

Clustering validation measures

Dunn index VI
Proposed method τ=20 0.031 1.708
τ=100 0.023 1.726
FunFEM 0.018 0.877
FunHDDC 0.029 1.576

The proposed method can also be applied to cluster step data for a particular individual. For this purpose, we selected the 67th individual with 364 recorded days. To observe this individual’s activity patterns, we summarized their steps in Fig. 12, presenting the mean time series of the step data on weekdays and weekends and the data from the least and the most active days. We observe that, on weekends, activities continue until midnight compared to weekdays, and the activity level varies from day to day.

Fig. 12.

Fig. 12

(Top) The average step data on weekdays and weekends of the 67th individual, and (Bottom) the step data from the least active day and the most active day

The results of the proposed method are provided in Fig. 13. The average time series from the four groups represent various levels and patterns of activity. However, the results are slightly different depending on the pen thickness. Days with activity early at τ=20 are clustered as Group 2 (moderate activity) but are not in one group when τ=100. Moreover, e-TPT focuses on various aspects of the time series depending on the pen thickness. For example, we considered the day plotted in Fig. 14. When τ=20, midnight activities are more evident, and the day is clustered as Group 2 in Fig. 13(a). However, the midnight activities do not seem to be much different from those in the morning when τ=100, and we have relatively thick plots in the mornings, although there are few activities. The e-TPT is sensitive to the high activity intensity when using a thicker pen. Therefore, with τ=100, the time series is clustered into the most active group: Group 4 in Fig. 13(b).

Fig. 13.

Fig. 13

Clustering result of 67th participant by the proposed method: (a) Thickness 20 – Mean time series of step data for each cluster; (b) Thickness 100 – Mean time series of step data for each cluster

Fig. 14.

Fig. 14

TPT when thickness is (left) 20 and (right) 100 for a selected day. The black solid line indicates the original step data

Newly Confirmed COVID-19 Case Data

We considered the number of new COVID-19 cases per day in Seoul, Korea, from February 5, 2020, to June 18, 2021, as a time series of length 500. There are 25 districts in Seoul, as depicted in Fig. 15, and the total number of newly confirmed cases during this period is summarized in Table 7. The rates when the number of confirmed cases is zero during the given period are listed in the table and are higher than 28% in all districts. In Jongno-gu, there are zero confirmed cases on more than half of the days, and the highest number of confirmed cases, 81, is observed in Gangseo-gu.

Fig. 15.

Fig. 15

Districts of Seoul, Korea

Table 7.

The total number of newly confirmed COVID-19 cases, the number of days with zero confirmed cases, and the maximum confirmed cases per day from February 5, 2020, to June 18, 2021, according to district in Seoul, Korea

District Total num of cases Num of zero days (%) Maximum confirmed cases in a day
Jongno-gu 791 258 (51.6%) 33
Jung-gu 741 250 (50.0%) 11
Yongsan-gu 1294 197 (39.4%) 28
Seongdong-gu 1296 201 (40.2%) 21
Gwangjin-gu 1570 218 (43.6%) 25
Dongdaemun-gu 1766 200 (40.0%) 32
Jungnang-gu 2091 192 (38.4%) 36
Seongbuk-gu 1957 183 (36.6%) 39
Gangbuk-gu 1370 226 (45.2%) 19
Dobong-gu 1467 187 (37.4%) 22
Nowon-gu 2175 175 (35.0%) 32
Eunpyeong-gu 2031 173 (34.6%) 31
Seodaemun-gu 1194 210 (42.0%) 15
Mapo-gu 1508 189 (37.8%) 31
Yangcheon-gu 1641 202 (40.4%) 33
Gangseo-gu 2273 158 (31.6%) 81
Guro-gu 1565 192 (38.4%) 40
Geumcheon-gu 794 249 (49.8 %) 14
Yeongdeungpo-gu 1756 180 (36.0%) 36
Dongjak-gu 1958 164 (32.8%) 36
Gwanak-gu 2197 131 (26.2%) 27
Seocho-gu 2058 159 (31.8%) 37
Gangnam-gu 2852 144 (28.8%) 38
Songpa-gu 2854 150 (30.0%) 29
Gangdong-gu 1900 183 (36.6%) 21

As illustrated in Fig. 3, the number of new COVID-19 cases per day in each district is zero-inflated time series data, and we apply the proposed method to cluster the 25 districts based on the time series patterns. The number of cluster groups is three, determined by the gap statistics. Figure 16 presents the clustering results using three different pen thicknesses (τ=10, 30, and 100). The clustering results vary depending on the pen thickness. When τ=10, only one district, Gwanak-gu, is classified as Group 1. Gwanak-gu has the least days with zero confirmed cases. Group 2 includes Gangnam-gu and Songpa-gu, and these districts have the two highest total confirmed cases during this period. When τ=100, the proposed method brings out coarser-scale features of the data, and Gangseo-gu, which has the highest number of confirmed cases, is classified as a single group. Figure 17 displays the mean time series of each group, and Table 8 lists the summary statistics of the clustering results according to the pen thickness, such as the number of districts, average number of cases, and average rate of zero days. We observe that the levels and patterns of groups vary according to the pen thickness, and the statistics of the clustering results also vary.

Fig. 16.

Fig. 16

Clustering results by the proposed method when τ=10,30 and 100. Cluster groups are color-coded

Fig. 17.

Fig. 17

Mean time series of COVID-19 data for each cluster by the proposed method with τ=10,30, and 100.

Table 8.

Summary statistics for each cluster obtained from proposed method with thicknesses of the pen 10, 30, and 100

Thickness Cluster 1 Cluster 2 Cluster 3
10 Number of districts 1 2 22
Average number of cases 4.394 5.706 3.200
Average percentage of zero days (%) 26.20 29.40 39.51
30 Number of districts 8 5 12
Average number of cases 2.263 3.450 4.237
Average percentage of zero days (%) 45.23 35.68 34.50
100 Number of districts 8 16 1
Average number of cases 2.263 3.972 4.546
Average percentage of zero days (%) 45.23 35.05 31.60

We apply the FunFEM, FunHDDC, and DTW methods to compare the COVID-19 data. The clustering results are provided in Fig. 18. The results of FunFEM and FunHDDC are identical, and some parts are similar to those of the proposed method using a pen thickness of τ=100. The Dunn index and VI for the clustering results are presented in Table 9. The proposed method with τ=10 and DTW shows high Dunn indices, while the proposed method with τ=30 yields the highest VI.

Fig. 18.

Fig. 18

Clustering results obtained from FunFEM (top left), FunHDDC (top right), and DTW (bottom). Cluster groups are color-coded

Table 9.

Clustering validation measures

Dunn index VI
Proposed method τ=10 0.534 0.443
τ=30 0.418 1.039
τ=100 0.469 0.779
FunFEM 0.361 0.895
FunHDDC 0.462 1.021
DTW 0.538 0.784

Concluding Remarks

In this study, we proposed a novel clustering method that can be applied to high-dimensional zero-inflated time series data. By modifying the TPT, we developed the e-TPT to improve the temporal resolution of zero-inflated time series data and introduced a similarity measure for zero-inflated time series data as an input variable for the clustering algorithm. Furthermore, an efficient iterative clustering algorithm was proposed. Finally, the effectiveness of the proposed method was demonstrated using simulation experiments and real-data analyses with step count data and newly confirmed COVID-19 case data.

As e-TPT solves the problem of exceeding zero in zero-inflated data, the proposed method can cluster zero-inflated time series, which is commonly observed in various fields. In addition, the proposed method provides a multiscale view of the data by considering various thicknesses of the e-TPT. If we use a thick pen, we can cluster time series based on the global trend, and a thin pen renders cluster groups divided based on the local features of the data. Furthermore, the proposed method addresses missing data issues by utilizing the TPT, which can accommodate missing data through the consideration of a large thickness. Similarly, e-TPT also tackles missing data by transforming the raw series into smoothed time-series data.

However, the time series length must be the same for the current algorithm to be applied. Future studies could explore to handle time series with varying lengths. Another issue in the proposed method is finding the pen’s optimal thickness. Although the CV technique has been used for the thickness selection in the current study, an optimal choice using a data-adaptive selection may improve the clustering performance of the proposed method. It is reserved for future research.

Funding Information

This study is supported by the National Research Foundation of Korea (NRF) funded by the Korea government (2022R1F1A1074134; 2020R1A4A1018207; 2021R1A2C1091357, 2021R1A2B5B01001790).

Data Availability

Data available on request from the authors.

Declarations

Ethical standard

This article does not contain any studies with human participants performed by any of the authors.

Conflicts of interest

Authors declare that they have no conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Bouveyron C, Côme E, Jacques J. The discriminative functional mixture model for a comparative analysis of bike sharing systems. Annals of Applied Statistics. 2015;9(4):1726–1760. doi: 10.1214/15-AOAS861. [DOI] [Google Scholar]
  2. Dunn JC. Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics. 1974;4(1):95–104. doi: 10.1080/01969727408546059. [DOI] [Google Scholar]
  3. Fryzlewicz P, Oh H-S. Thick pen transformation for time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2011;73(4):499–529. doi: 10.1111/j.1467-9868.2011.00773.x. [DOI] [Google Scholar]
  4. Fryzlewicz P, Ombao H. Consistent classification of nonstationary time series using stochastic wavelet representations. Journal of the American Statistical Association. 2009;104(485):299–312. doi: 10.1198/jasa.2009.0110. [DOI] [Google Scholar]
  5. Hartigan J A, Wong M A. Algorithm as 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1979;28(1):100–108. [Google Scholar]
  6. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218. doi: 10.1007/BF01908075. [DOI] [Google Scholar]
  7. Leisch F. A toolbox for K-centroids cluster analysis. Computational Statistics & Data Analysis. 2006;51(2):526–544. doi: 10.1016/j.csda.2005.10.006. [DOI] [Google Scholar]
  8. Lim HK, Li WK, Philip L. Zero-inflated poisson regression mixture model. Computational Statistics & Data Analysis. 2014;71:151–158. doi: 10.1016/j.csda.2013.06.021. [DOI] [Google Scholar]
  9. Meilă M. Comparing clusterings–an information based distance. Journal of Multivariate Analysis. 2007;98(5):873–895. doi: 10.1016/j.jmva.2006.11.013. [DOI] [Google Scholar]
  10. Pakhira MK, Bandyopadhyay S, Maulik U. Validity index for crisp and fuzzy clusters. Pattern Recognition. 2004;37(3):487–501. doi: 10.1016/j.patcog.2003.06.005. [DOI] [Google Scholar]
  11. Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971;66(336):846–850. doi: 10.1080/01621459.1971.10482356. [DOI] [Google Scholar]
  12. Sarda-Espinosa A. dtwclust: Time series clustering along with optimizations for the dynamic time warping distance. R package version. 2023;5(5):12. [Google Scholar]
  13. Schmutz A, Jacques J, Bouveyron C, Cheze L, Martin P. Clustering multivariate functional data in group-specific functional subspaces. Computational Statistics. 2018;35(3):1101–1131. doi: 10.1007/s00180-020-00958-4. [DOI] [Google Scholar]
  14. Shutaywi M, Kachouie NN. Silhouette analysis for performance evaluation in machine learning with applications to clustering. Entropy. 2021;23(6):759. doi: 10.3390/e23060759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Tawiah K, Iddrisu WA, Asampana Asosega K. Zero-inflated time series modelling of COVID-19 deaths in Ghana. Journal of Environmental and Public Health. 2021;2021:5543977. doi: 10.1155/2021/5543977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2001;63:411–423. doi: 10.1111/1467-9868.00293. [DOI] [Google Scholar]
  17. Wang, W., Lyu, G., Shi, Y., & Liang, X. (2018). Time series clustering based on dynamic time warping. In 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), (pp. 487–490). IEEE
  18. Yau KK, Wang K, Lee AH. Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometrical Journal. 2003;45(4):437–452. doi: 10.1002/bimj.200390024. [DOI] [Google Scholar]
  19. Zhang X, Guo B, Yi N. Zero-inflated gaussian mixed models for analyzing longitudinal microbiome data. PLoS One. 2020;15(11):e0242073. doi: 10.1371/journal.pone.0242073. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data available on request from the authors.


Articles from Journal of Classification are provided here courtesy of Nature Publishing Group

RESOURCES